Crash report for some ICU-52 (debian8) COLLATE and work_mem values
Hi,
With 10beta2 built on Debian 8 with
./configure --enable-debug --with-icu
and the ICU package currently in the "jessie" Debian repo:
$ dpkg -l 'libicu*'
...
ii libicu-dev:amd 52.1-8+deb8u amd64 Development files for
Internation
ii libicu52:amd64 52.1-8+deb8u amd64 International Components for
Unic
ii libicu52-dbg 52.1-8+deb8u amd64 International Components for
Unic
I've got a table with 6,6 million unique small bits of text from different
Unicode alphabets:
Table "public.words_test"
Column | Type | Collation | Nullable | Default
----------+------+-----------+----------+---------
wordtext | text | | |
and found that running the following query on it consistently
provokes a SIGSEGV with certain collations:
SELECT count(distinct wordtext COLLATE :"collname") FROM words_test;
Some of the collations that crash:
az-Latn-AZ-u-co-search-x-icu
bs-Latn-BA-u-co-search-x-icu
bs-x-icu
cs-CZ-u-co-search-x-icu
de-BE-u-co-phonebk-x-icu
sr-Latn-XK-x-icu
zh-Hans-CN-u-co-big5han-x-icu
Trying all of them I had 146 crashes out of the 1741 ICU
entries in pg_collation created by initdb.
The size of the table is 291MB, and work_mem to 128MB.
Reducing the dataset tends to make the problem disappear: if I split
the table in halves based on row_number() to bisect on the data,
the queries on both parts pass without crashing.
Below is a backtrace got with collate "az-Latn-AZ-u-co-search-x-icu",
and work_mem to 128MB. Cranking up work_mem to 512MB makes
the crash not happen, but 300MB is not enough.
(by comparison, the same query with collate "en_US.utf8" or "fr-x-icu"
runs fine with work_mem to 4MB)
#0 0x00007fc3c5017030 in ucol_getLatinOneContractionUTF8 (
coll=coll@entry=0x2824ef0, strength=strength@entry=0, CE=<optimized out>,
s=s@entry=0x383a3ec
"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200",
index=index@entry=0x7fff46afe290, len=len@entry=7) at ucol.cpp:8044
#1 0x00007fc3c502917c in ucol_strcollUseLatin1UTF8 (status=0x7fff46afe338,
tLen=7,
target=0x383a3ec
"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200",
sLen=6,
source=0x3839bec
"wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230",
coll=0x2824ef0) at ucol.cpp:8153
#2 ucol_strcollUTF8_52 (coll=<optimized out>,
source=0x3839bec
"wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230",
source@entry=0x3839be9
"reqwuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\226\350\257\221\343\200\202\204u\347\233\230\357\274\214\272\346\211\213\344\272\206\344\270\200\344\272\233\343\200\202ct\342\224\200\342\224\230",
sourceLength=<optimized out>, sourceLength@entry=9,
target=0x383a3ec
"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200",
target@entry=0x383a3e9
"requ\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200",
targetLength=7, targetLength@entry=10,
status=status@entry=0x7fff46afe338) at ucol.cpp:8770
#3 0x00000000007cc7b4 in varstrfastcmp_locale (x=58956776, y=58958824,
ssup=<optimized out>) at varlena.c:2139
#4 0x00000000008170b6 in ApplySortComparator (ssup=0x28171b8,
isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>,
datum1=<optimized out>) at
../../../../src/include/utils/sortsupport.h:225
#5 comparetup_datum (a=0x2818ef8, b=0x2818f10, state=0x2816fa8)
at tuplesort.c:4341
#6 0x0000000000815623 in tuplesort_heap_replace_top (
state=state@entry=0x2816fa8, tuple=tuple@entry=0x7fff46afe410,
checkIndex=checkIndex@entry=0 '\000') at tuplesort.c:3510
#7 0x0000000000816d8c in tuplesort_gettuple_common (
state=state@entry=0x2816fa8, forward=forward@entry=1 '\001',
stup=stup@entry=0x7fff46afe460) at tuplesort.c:2082
#8 0x000000000081b176 in tuplesort_getdatum (state=0x2816fa8,
forward=forward@entry=1 '\001', val=val@entry=0x28130f0,
isNull=isNull@entry=0x2813409 "", abbrev=abbrev@entry=0x7fff46afe518)
at tuplesort.c:2205
#9 0x00000000005ea107 in process_ordered_aggregate_single (
pergroupstate=0x2812f38, pertrans=0x2812f98, aggstate=0x2811198)
at nodeAgg.c:1330
#10 finalize_aggregates (aggstate=aggstate@entry=0x2811198,
peraggs=peraggs@entry=0x2812588, pergroup=<optimized out>)
at nodeAgg.c:1736
#11 0x00000000005eaabd in agg_retrieve_direct (aggstate=0x2811198)
at nodeAgg.c:2464
#12 ExecAgg (node=node@entry=0x2811198) at nodeAgg.c:2117
#13 0x00000000005e2378 in ExecProcNode (node=node@entry=0x2811198)
at execProcnode.c:539
#14 0x00000000005ddf1e in ExecutePlan (execute_once=<optimized out>,
dest=0x27f7eb0, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_SELECT,
use_parallel_mode=<optimized out>, planstate=0x2811198, estate=0x2810f88)
at execMain.c:1693
#15 standard_ExecutorRun (queryDesc=0x280d6d8, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:362
#16 0x00000000006f924c in PortalRunSelect (portal=portal@entry=0x280ef78,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x27f7eb0) at pquery.c:932
#17 0x00000000006fa5f0 in PortalRun (portal=0x280ef78,
count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimized out>, dest=0x27f7eb0, altdest=0x27f7eb0,
completionTag=0x7fff46afe830 "") at pquery.c:773
#18 0x00000000006f67c3 in exec_simple_query (
query_string=0x383a3ec
"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\200")
at postgres.c:1099
#19 0x00000000006f842a in PostgresMain (argc=1, argv=0x27d5f68,
dbname=0x2752288 "mlists", username=0x276a298 "postgres")
at postgres.c:4090
#20 0x000000000047803f in BackendRun (port=0x274bf30) at postmaster.c:4357
#21 BackendStartup (port=0x274bf30) at postmaster.c:4029
#22 ServerLoop () at postmaster.c:1753
#23 0x0000000000692a82 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x2724330) at postmaster.c:1361
#24 0x0000000000478f8e in main (argc=3, argv=0x2724330) at main.c:228
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Daniel Verite <daniel@manitou-mail.org> wrote:
SELECT count(distinct wordtext COLLATE :"collname") FROM words_test;
Some of the collations that crash:
az-Latn-AZ-u-co-search-x-icu
bs-Latn-BA-u-co-search-x-icu
bs-x-icu
cs-CZ-u-co-search-x-icu
de-BE-u-co-phonebk-x-icu
sr-Latn-XK-x-icu
zh-Hans-CN-u-co-big5han-x-icuTrying all of them I had 146 crashes out of the 1741 ICU
entries in pg_collation created by initdb.The size of the table is 291MB, and work_mem to 128MB.
Reducing the dataset tends to make the problem disappear: if I split
the table in halves based on row_number() to bisect on the data,
the queries on both parts pass without crashing.
I think that this sensitivity to work_mem exists because abbreviated
keys are used for quicksort operations that sort individual runs.
As work_mem is increased, and less merging is required, affected
codepaths are reached less frequently. You would probably find that the
problem appears more consistently if varstr_sortsupport() is modified so
that even ICU collations never use abbreviated keys; that would be a
matter of "abbreviate" always being set to false within that function.
I suggest using the new amcheck contrib module as part of this testing
(you'll need to use CREATE INDEX to have an index to perform
verification against). This will zero in on inconsistencies that may be
far more subtle than a hard crash. I wouldn't assume that abbreviated
key comparisons are correct here just because there is no hard crash.
Does the crash always have ucol_strcollUseLatin1UTF8() in its backtrace?
--
Peter Geoghegan
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan wrote:
Does the crash always have ucol_strcollUseLatin1UTF8() in its backtrace?
AFAICS, yes.
The test that iterates over collations produces two kinds of core files,
some of them are 289MB large, some others are 17GB large.
shared_buffers is only 128MB and work_mem 128MB,
so 289MB is not surprising but 17GB seems excessive.
The box has 16GB of physical mem and 8GB of swap.
I haven't checked all core files because they exhaust the disk
space before completion of the test, but a typical backtrace for
the biggest ones looks like the following, with the segfaults
happening in memcpy:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 __memcpy_sse2_unaligned ()
at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
(gdb) #0 __memcpy_sse2_unaligned ()
at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
#1 0x00007fc1db02be6b in memcpy (__len=8589934592, __src=0x7fbc6f4a2010,
__dest=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/string3.h:51
#2 ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd90777128,
b=0x7ffd907751e0) at ucol.cpp:7009
#3 UCOL_CEBUF_PUT (status=0x7ffd90777128, ci=0x7ffd90776460, ce=1493173509,
b=0x7ffd907751e0) at ucol.cpp:7022
#4 ucol_strcollRegular (sColl=sColl@entry=0x7ffd90776460,
tColl=tColl@entry=0x7ffd90776610, status=status@entry=0x7ffd90777128)
at ucol.cpp:7163
#5 0x00007fc1db031177 in ucol_strcollRegularUTF8 (coll=0x1371af0,
source=source@entry=0x273d379 "콗喩zx㎍",
sourceLength=sourceLength@entry=11, target=<optimized out>,
targetLength=targetLength@entry=8, status=status@entry=0x7ffd90777128)
at ucol.cpp:8023
#6 0x00007fc1db032d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>,
tLen=<optimized out>, target=<optimized out>, sLen=<optimized out>,
source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108
#7 ucol_strcollUTF8_52 (coll=<optimized out>,
source=source@entry=0x273d379 "콗喩zx㎍", sourceLength=<optimized out>,
sourceLength@entry=11, target=<optimized out>,
target@entry=0x273d409 "쳭喩zz", targetLength=targetLength@entry=8,
status=status@entry=0x7ffd90777128) at ucol.cpp:8770
#8 0x00000000007cc7b4 in varstrfastcmp_locale (x=41145208, y=41145352,
ssup=<optimized out>) at varlena.c:2139
#9 0x0000000000817138 in ApplySortAbbrevFullComparator (ssup=0x136bd98,
isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>,
datum1=<optimized out>) at
../../../../src/include/utils/sortsupport.h:263
#10 comparetup_datum (a=0x7fc1c642d210, b=0x7fc1c642d228,
state=<optimized out>) at tuplesort.c:4350
#11 0x0000000000815aef in qsort_tuple (a=0x7fc1c642d210, n=<optimized out>,
n@entry=2505, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>,
state=state@entry=0x136bb88) at qsort_tuple.c:104
#12 0x0000000000815c1f in qsort_tuple (a=0x7fc1c6413020, n=<optimized out>,
n@entry=5279, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>,
state=state@entry=0x136bb88) at qsort_tuple.c:191
#13 0x0000000000815c1f in qsort_tuple (a=0x7fc1c63f27e8, n=<optimized out>,
n@entry=103353, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>,
state=state@entry=0x136bb88) at qsort_tuple.c:191
#14 0x0000000000815c1f in qsort_tuple (a=0x7fc1c5f423c8, n=<optimized out>,
n@entry=503456, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>,
state=state@entry=0x136bb88) at qsort_tuple.c:191
#15 0x0000000000815c1f in qsort_tuple (a=0x7fc1c34ae048, n=<optimized out>,
cmp_tuple=0x817030 <comparetup_datum>, state=state@entry=0x136bb88)
at qsort_tuple.c:191
#16 0x000000000081873b in tuplesort_sort_memtuples (
state=state@entry=0x136bb88) at tuplesort.c:3410
#17 0x0000000000818907 in dumpbatch (alltuples=0 '\000', state=0x136bb88)
at tuplesort.c:3087
#18 dumptuples (state=state@entry=0x136bb88,
alltuples=alltuples@entry=0 '\000') at tuplesort.c:2970
#19 0x0000000000818b81 in puttuple_common (state=state@entry=0x136bb88,
tuple=tuple@entry=0x7ffd90777470) at tuplesort.c:1719
#20 0x000000000081a6f1 in tuplesort_putdatum (state=0x136bb88,
val=<optimized out>, isNull=<optimized out>) at tuplesort.c:1558
#21 0x00000000005e91ec in advance_aggregates (
aggstate=aggstate@entry=0x13731f8, pergroup=pergroup@entry=0x1376008,
pergroups=0x0) at nodeAgg.c:1023
#22 0x00000000005eac03 in agg_retrieve_direct (aggstate=0x13731f8)
at nodeAgg.c:2402
#23 ExecAgg (node=node@entry=0x13731f8) at nodeAgg.c:2117
#24 0x00000000005e2378 in ExecProcNode (node=node@entry=0x13731f8)
at execProcnode.c:539
#25 0x00000000005ddf1e in ExecutePlan (execute_once=<optimized out>,
dest=0x132f368, direction=<optimized out>, numberTuples=0,
sendTuples=<optimized out>, operation=CMD_SELECT,
use_parallel_mode=<optimized out>, planstate=0x13731f8, estate=0x1372fe8)
at execMain.c:1693
#26 standard_ExecutorRun (queryDesc=0x1350938, direction=<optimized out>,
count=0, execute_once=<optimized out>) at execMain.c:362
#27 0x00000000006f924c in PortalRunSelect (portal=portal@entry=0x1362d98,
forward=forward@entry=1 '\001', count=0, count@entry=9223372036854775807,
dest=dest@entry=0x132f368) at pquery.c:932
#28 0x00000000006fa5f0 in PortalRun (portal=0x1362d98,
count=9223372036854775807, isTopLevel=<optimized out>,
run_once=<optimized out>, dest=0x132f368, altdest=0x132f368,
completionTag=0x7ffd90777990 "") at pquery.c:773
#29 0x00000000006f7da1 in exec_execute_message (max_rows=9223372036854775807,
portal_name=<optimized out>) at postgres.c:1984
#30 PostgresMain (argc=3, argv=0x7fffffffffffffff, dbname=0x12cb288 "mlists",
username=0x132f368 "") at postgres.c:4153
#31 0x000000000047803f in BackendRun (port=0x12c4f30) at postmaster.c:4357
#32 BackendStartup (port=0x12c4f30) at postmaster.c:4029
#33 ServerLoop () at postmaster.c:1753
#34 0x0000000000692a82 in PostmasterMain (argc=argc@entry=3,
argv=argv@entry=0x129d330) at postmaster.c:1361
#35 0x0000000000478f8e in main (argc=3, argv=0x129d330) at main.c:228
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Tue, Aug 1, 2017 at 12:45 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
The test that iterates over collations produces two kinds of core files,
some of them are 289MB large, some others are 17GB large.
shared_buffers is only 128MB and work_mem 128MB,
so 289MB is not surprising but 17GB seems excessive.
The box has 16GB of physical mem and 8GB of swap.I haven't checked all core files because they exhaust the disk
space before completion of the test, but a typical backtrace for
the biggest ones looks like the following, with the segfaults
happening in memcpy:Program terminated with signal SIGSEGV, Segmentation fault.
#0 __memcpy_sse2_unaligned ()
at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
(gdb) #0 __memcpy_sse2_unaligned ()
at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
#1 0x00007fc1db02be6b in memcpy (__len=8589934592, __src=0x7fbc6f4a2010,
__dest=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/string3.h:51
#2 ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd90777128,
b=0x7ffd907751e0) at ucol.cpp:7009
#3 UCOL_CEBUF_PUT (status=0x7ffd90777128, ci=0x7ffd90776460, ce=1493173509,
b=0x7ffd907751e0) at ucol.cpp:7022
#4 ucol_strcollRegular (sColl=sColl@entry=0x7ffd90776460,
tColl=tColl@entry=0x7ffd90776610, status=status@entry=0x7ffd90777128)
at ucol.cpp:7163
#5 0x00007fc1db031177 in ucol_strcollRegularUTF8 (coll=0x1371af0,
source=source@entry=0x273d379 "콗喩zx㎍",
sourceLength=sourceLength@entry=11, target=<optimized out>,
targetLength=targetLength@entry=8, status=status@entry=0x7ffd90777128)
at ucol.cpp:8023
#6 0x00007fc1db032d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>,
tLen=<optimized out>, target=<optimized out>, sLen=<optimized out>,
source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108
#7 ucol_strcollUTF8_52 (coll=<optimized out>,
source=source@entry=0x273d379 "콗喩zx㎍", sourceLength=<optimized out>,
sourceLength@entry=11, target=<optimized out>,
target@entry=0x273d409 "쳭喩zz", targetLength=targetLength@entry=8,
status=status@entry=0x7ffd90777128) at ucol.cpp:8770
Interesting. The "__len" argument to memcpy() is 8589934592 -- that's
2 ^ 33. (I'm not sure why it's the first memcpy() argument in the
stack trace, since it's supposed to be the last -- anyone seen that
before?)
Can you figure out what the optimized-out lengths are, by either
looking at registers within GDB, or building at a lower optimization
level?
Maybe this is a bug in ICU-52. For reference, here is ICU-52's
ucol_CEBuf_Expand() function:
static
void ucol_CEBuf_Expand(ucol_CEBuf *b, collIterate *ci, UErrorCode *status) {
uint32_t oldSize;
uint32_t newSize;
uint32_t *newBuf;
ci->flags |= UCOL_ITER_ALLOCATED;
oldSize = (uint32_t)(b->pos - b->buf);
newSize = oldSize * 2;
newBuf = (uint32_t *)uprv_malloc(newSize * sizeof(uint32_t));
if(newBuf == NULL) {
*status = U_MEMORY_ALLOCATION_ERROR;
}
else {
uprv_memcpy(newBuf, b->buf, oldSize * sizeof(uint32_t));
if (b->buf != b->localArray) {
uprv_free(b->buf);
}
b->buf = newBuf;
b->endp = b->buf + newSize;
b->pos = b->buf + oldSize;
}
}
If "oldSize * sizeof(uint32_t)" becomes what we see as "__len", as I
believe it does, then that must mean that oldSize is 2 ^ 31. *Not* 2 ^
31 - 1 (INT_MAX). I think that this could be an off-by-one bug, since
ucol_strcollUTF8()/ucol_strcollUTF8_52() accepts an int32 argument for
sourceLength and targetLength. I'm not very confident of this, but it
does make a certain amount of sense. It could be that everyone else is
passing -1 as sourceLength and targetLength arguments, anyway, to
indicate that the buffer is NUL-terminated, as required by regular
strcoll().
Note also that the docs say this of ucol_strcollUTF8(): "When input
string contains malformed a UTF-8 byte sequence, this function treats
these bytes as REPLACEMENT CHARACTER (U+FFFD)". I'm not sure that
that's a very sensible way for it to fail.
I'd be interested to see if anything changed when -1 was passed as
both sourceLength and targetLength to ucol_strcollUTF8(). You'd have
to build Postgres yourself to test this, but it would just work, since
we don't actually avoid NUL termination, even though in principled we
could with ICU.
--
Peter Geoghegan
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan wrote:
I'd be interested to see if anything changed when -1 was passed as
both sourceLength and targetLength to ucol_strcollUTF8(). You'd have
to build Postgres yourself to test this, but it would just work, since
we don't actually avoid NUL termination, even though in principled we
could with ICU.
It doesn't seem to change the outcome:
*** varlena.c~ 2017-07-10 22:26:20.000000000 +0200
--- varlena.c 2017-08-02 12:57:25.997265001 +0200
***************
*** 2137,2144 ****
status = U_ZERO_ERROR;
result =
ucol_strcollUTF8(sss->locale->info.icu.ucol,
!
a1p, len1,
!
a2p, len2,
&status);
if (U_FAILURE(status))
ereport(ERROR,
--- 2137,2144 ----
status = U_ZERO_ERROR;
result =
ucol_strcollUTF8(sss->locale->info.icu.ucol,
!
a1p, -1, /* len1 */
!
a2p, -1, /*len2,*/
&status);
if (U_FAILURE(status))
ereport(ERROR,
For the case when its ends up crashing in memcpy, the backtrace now looks
like this:
#0 __memcpy_sse2_unaligned ()
at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
#1 0x00007f0156812e6b in memcpy (__len=8589934592, __src=0x7efbeac89010,
__dest=<optimized out>) at
/usr/include/x86_64-linux-gnu/bits/string3.h:51
#2 ucol_CEBuf_Expand (ci=<optimized out>, status=0x7ffd49c4ca98,
b=0x7ffd49c4ab60) at ucol.cpp:7009
#3 UCOL_CEBUF_PUT (status=0x7ffd49c4ca98, ci=0x7ffd49c4bde0, ce=1493173509,
b=0x7ffd49c4ab60) at ucol.cpp:7022
#4 ucol_strcollRegular (sColl=sColl@entry=0x7ffd49c4bde0,
tColl=tColl@entry=0x7ffd49c4bf90, status=status@entry=0x7ffd49c4ca98)
at ucol.cpp:7163
#5 0x00007f0156818177 in ucol_strcollRegularUTF8 (coll=0x2286100,
source=source@entry=0x3651829 "쳭喩zz",
sourceLength=sourceLength@entry=-1, target=<optimized out>,
targetLength=targetLength@entry=-1, status=status@entry=0x7ffd49c4ca98)
at ucol.cpp:8023
#6 0x00007f0156819d36 in ucol_strcollUseLatin1UTF8 (status=<optimized out>,
tLen=<optimized out>, target=<optimized out>, sLen=<optimized out>,
source=<optimized out>, coll=<optimized out>) at ucol.cpp:8108
#7 ucol_strcollUTF8_52 (coll=<optimized out>,
source=source@entry=0x3651829 "쳭喩zz", sourceLength=<optimized out>,
sourceLength@entry=-1, target=<optimized out>,
target@entry=0x3651799 "콗喩zx㎍", targetLength=targetLength@entry=-1,
status=status@entry=0x7ffd49c4ca98) at ucol.cpp:8770
#8 0x00000000007cc7df in varstrfastcmp_locale (x=56956968, y=56956824,
ssup=<optimized out>) at varlena.c:2139
#9 0x0000000000817168 in ApplySortAbbrevFullComparator (ssup=0x2280338,
isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>,
datum1=<optimized out>) at
../../../../src/include/utils/sortsupport.h:263
#10 comparetup_datum (a=0x7f0141c14228, b=0x7f0141c14240,
state=<optimized out>) at tuplesort.c:4350
#11 0x0000000000815b1f in qsort_tuple (a=0x7f0141c14210, n=<optimized out>,
n@entry=106, cmp_tuple=cmp_tuple@entry=0x817060 <comparetup_datum>,
state=state@entry=0x2280128) at qsort_tuple.c:104
.....
Anyway I'll try to build ICU-52 without optimization to avoid these
<optimized out> parameters.
Also running the same tests with a self-compiled ICU-59 versus debian's
packaged
ICU-52 results in *no crash at all*. So maybe the root cause of these crashes
is
that a critical fix in ICU has not been backported?
I note that the list of ICU locales available in pg_collation is different
though.
When compiled with ICU-59, initdb produces "only" 581 collname
matching '%icu%', versus 1741 with ICU-52. The ICU version
is the only thing that differs between both cases.
But anyway in the list of collations that crash for me with ICU-52, a fair
number
(~50) are still there and work in ICU-59, so I think the comparison is still
meaningful.
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Wed, Aug 2, 2017 at 5:00 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
Anyway I'll try to build ICU-52 without optimization to avoid these
<optimized out> parameters.Also running the same tests with a self-compiled ICU-59 versus debian's
packaged
ICU-52 results in *no crash at all*. So maybe the root cause of these crashes
is
that a critical fix in ICU has not been backported?
That seems possible. Certainly, there have been a fair amount of
bugfixes against this package, some as recently as April of this year:
http://metadata.ftp-master.debian.org/changelogs/main/i/icu/icu_52.1-8+deb8u5_changelog
I'm sure that you used a version that has the relevant fixes, but it's
not that hard to imagine that new bugs were introduced in the process
of backpatching upstream fixes from later major ICU releases. There
has only been one official ICU point release for 52, which was way
back in 2013. Unofficial backpatches are hazardous in their own way.
I note that the list of ICU locales available in pg_collation is different
though.
When compiled with ICU-59, initdb produces "only" 581 collname
matching '%icu%', versus 1741 with ICU-52. The ICU version
is the only thing that differs between both cases.
But anyway in the list of collations that crash for me with ICU-52, a fair
number
(~50) are still there and work in ICU-59, so I think the comparison is still
meaningful.
I agree that it's a meaningful comparison. It now seems very likely
that this is a bug in ICU, and possibly this particular package. That
seemed likely before now anyway; there is not that much that could go
wrong with PostgreSQL's use of ICU that is peculiar to only a small
minority of available collations. Perhaps you can whittle down your
testcase into something minimal, for the Debian package maintainer.
--
Peter Geoghegan
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On 8/2/17 08:00, Daniel Verite wrote:
I note that the list of ICU locales available in pg_collation is different
though.
When compiled with ICU-59, initdb produces "only" 581 collname
matching '%icu%', versus 1741 with ICU-52.
This is because the older version produces redundant locale names such
as de-BE-*, de-CH-*, de-DE-*, whereas the newer version omits those
because they are (presumably?) all equivalent to de-*.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan wrote:
it's not that hard to imagine that new bugs were introduced in the process
of backpatching upstream fixes from later major ICU releases
I've now tried with ICU-52.1 from upstream, built with
gcc -O0 -g to get cleaner stack traces.
The behavior is similar to debian's precompiled lib in how and where
it crashes.
work_mem is at 128MB.
query #1:
select count(distinct wordtext collate "az-x-icu") from words_test;
The backend terminates with SIGSEGV and a 289MB core file :
at ucol.cpp:8033
8033 while(schar > (tchar = *(UCharOffset+offset))) { /* since the
contraction codepoints should be ordered, we skip all that are smaller */
#0 0x00007f7926e12341 in ucol_getLatinOneContractionUTF8 (coll=0x1b422c0,
strength=0, CE=4061217513,
s=0x2b56b5c
"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220",
index=0x7ffde80dc978, len=7) at ucol.cpp:8033
#1 0x00007f7926e12845 in ucol_strcollUseLatin1UTF8 (coll=0x1b422c0,
source=0x2b56b5c
"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220",
sLen=7,
target=0x2b5735c
"wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\273\345\205\263\351\227\255\347\232\204\351\202\243\344\270\252\346\217\220\347\244\272\347\252\227\345\234\250\346\234\200\345\272\225\344\270\213\222\230",
tLen=6, status=0x7ffde80dcab8) at ucol.cpp:8104
#2 0x00007f7926e1487a in ucol_strcollUTF8_52 (coll=0x1b422c0,
source=0x2b56b5c
"u\364\217\273\252tectuablesow-what-it-is-about_tlsstnamek\224\257\346\214\201gb2312\357\274\214\202\232\204\344\271\237\351\201\207\345\210\260\350\277\207\344\270\200\344\272\233\345\205\266\345\256\203\351\227\256\351\242\230\357\274\214\345\233\240\344\270\272\346\262\241\346\234\211\350\277\231\344\270\252\351\227\256\351\242\230\344\270\245\351\207\215\357\274\214\350\277\230\347\256\227\344\270\215\351\224\231\343\200\202\200\342\224\220",
sourceLength=7,
target=0x2b5735c
"wuiredntnsookiemmand-tcl-testsuit-tp65374p65375\204\346\226\231\344\271\237\345\276\210\345\244\232\343\200\202\257debian\345\222\214redhat\344\272\206\357\274\214\344\270\244\344\270\252\215\263\345\264\251\346\272\203\343\200\202\273\345\205\263\351\227\255\347\232\204\351\202\243\344\270\252\346\217\220\347\244\272\347\252\227\345\234\250\346\234\200\345\272\225\344\270\213\222\230",
targetLength=6, status=0x7ffde80dcab8) at ucol.cpp:8759
#3 0x00000000007cc7b4 in varstrfastcmp_locale (x=45443928, y=45445976,
ssup=<optimized out>) at varlena.c:2139
#4 0x00000000008170b6 in ApplySortComparator (ssup=0x1b344a8,
isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>,
datum1=<optimized out>) at
../../../../src/include/utils/sortsupport.h:225
#5 comparetup_datum (a=0x7ffde80dcb90, b=0x1b36218, state=0x1b34298)
at tuplesort.c:4341
#6 0x00000000008154d9 in tuplesort_heap_replace_top (
state=state@entry=0x1b34298, tuple=tuple@entry=0x7ffde80dcb90,
checkIndex=checkIndex@entry=0 '\000') at tuplesort.c:3512
...cut...
query #2: select count(distinct wordtext collate
"bs-Cyrl-BA-u-co-search-x-icu")
from words_test;
The backend terminates with SIGSEGV leaving a 17GB core file
(there's definitely a uint32 offset overflow from 2^31->0
in ucol_CEBuf_Expand just before the crash in memcpy)
#0 __memcpy_sse2_unaligned ()
at ../sysdeps/x86_64/multiarch/memcpy-sse2-unaligned.S:35
#1 0x00007fbad6951ec3 in ucol_CEBuf_Expand (b=0x7ffdef2d2390,
ci=0x7ffdef2d3560, status=0x7ffdef2d41e8) at ucol.cpp:6998
#2 0x00007fbad6951f68 in UCOL_CEBUF_PUT (b=0x7ffdef2d2390, ce=1493173509,
ci=0x7ffdef2d3560, status=0x7ffdef2d41e8) at ucol.cpp:7011
#3 0x00007fbad69525c7 in ucol_strcollRegular (sColl=0x7ffdef2d3560,
tColl=0x7ffdef2d3710, status=0x7ffdef2d41e8) at ucol.cpp:7152
#4 0x00007fbad6954080 in ucol_strcollRegularUTF8 (coll=0xf75610,
source=0x23318a9 "\354\275\227\345\226\251zx\343\216\215",
sourceLength=11, target=0x2331939 "\354\263\255\345\226\251zz",
targetLength=8, status=0x7ffdef2d41e8) at ucol.cpp:8012
#5 0x00007fbad6956845 in ucol_strcollUTF8_52 (coll=0xf75610,
source=0x23318a9 "\354\275\227\345\226\251zx\343\216\215",
sourceLength=11, target=0x2331939 "\354\263\255\345\226\251zz",
targetLength=8, status=0x7ffdef2d41e8) at ucol.cpp:8757
#6 0x00000000007cc7b4 in varstrfastcmp_locale (x=36903080, y=36903224,
ssup=<optimized out>) at varlena.c:2139
#7 0x0000000000817138 in ApplySortAbbrevFullComparator (ssup=0xf67888,
isNull2=<optimized out>, datum2=<optimized out>, isNull1=<optimized out>,
datum1=<optimized out>) at
../../../../src/include/utils/sortsupport.h:263
#8 comparetup_datum (a=0x7fbac1cd6210, b=0x7fbac1cd6228,
state=<optimized out>) at tuplesort.c:4350
#9 0x0000000000815aef in qsort_tuple (a=0x7fbac1cd6210, n=<optimized out>,
n@entry=2505, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>,
state=state@entry=0xf67678) at qsort_tuple.c:104
#10 0x0000000000815c1f in qsort_tuple (a=0x7fbac1cbc020, n=<optimized out>,
n@entry=5279, cmp_tuple=cmp_tuple@entry=0x817030 <comparetup_datum>,
state=state@entry=0xf67678) at qsort_tuple.c:191
...cut...
Next I've tried with valgrind, invoked with:
valgrind \
--suppressions=$HOME/src/postgresql-10beta2/src/tools/valgrind.supp \
--trace-children=yes --track-origins=yes --error-limit=no \
--read-var-info=yes --log-file=/tmp/log-valgrind \
bin/postgres -D $PWD/data
(and turning off autovacuum because it causes some unrelated reports).
With query #1 the run goes to completion without crashing (hi heisenbug!),
but it reports many uses of uninitialised values under ucol_strcollUTF8().
The full log is attached in log-valgrind-1.txt.gz
With query #2 it ends up crashing after ~5hours and produces
the log in log-valgrind-2.txt.gz with some other entries than
case #1, but AFAICS still all about reading uninitialised values
in space allocated by datumCopy().
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
With query #2 it ends up crashing after ~5hours and produces
the log in log-valgrind-2.txt.gz with some other entries than
case #1, but AFAICS still all about reading uninitialised values
in space allocated by datumCopy().
Right. This part is really interesting to me:
==48827== Uninitialised value was created by a heap allocation
==48827== at 0x4C28C20: malloc (vg_replace_malloc.c:296)
==48827== by 0x80B597: AllocSetAlloc (aset.c:771)
==48827== by 0x810ADC: palloc (mcxt.c:862)
==48827== by 0x72BFEF: datumCopy (datum.c:171)
==48827== by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515)
==48827== by 0x5E91EB: advance_aggregates (nodeAgg.c:1023)
If you actually go to datum.c:171, you see that that's a codepath for
pass-by-reference datatypes that lack a varlena header. Text is a
datatype that has a varlena header, though, so that's clearly wrong. I
don't know how this actually happened, but working back through the
relevant tuplesort_begin_datum() caller, initialize_aggregate(), does
suggest some things. (tuplesort_begin_datum() is where datumTypeLen is
determined for the entire datum tuplesort.)
I am once again only guessing, but I have to wonder if this is a
problem in commit b8d7f053. It seems likely that the problem begins
before tuplesort_begin_datum() is even called, which is the basis of
this suspicion. If the problem is within tuplesort, then that could
only be because get_typlenbyval() gives wrong answers, which seems
very unlikely.
--
Peter Geoghegan
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
With query #1 the run goes to completion without crashing (hi heisenbug!),
but it reports many uses of uninitialised values under ucol_strcollUTF8().
The full log is attached in log-valgrind-1.txt.gzWith query #2 it ends up crashing after ~5hours and produces
the log in log-valgrind-2.txt.gz with some other entries than
case #1, but AFAICS still all about reading uninitialised values
in space allocated by datumCopy().
It would be nice if you could confirm whether or not Valgrind
complains when non-ICU collations are in use. It may just have been
that we get (un)lucky with ICU, where the undefined behavior happens
to result in a hard crash, more or less by accident.
--
Peter Geoghegan
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan wrote:
It would be nice if you could confirm whether or not Valgrind
complains when non-ICU collations are in use.
It doesn't (with en_US.utf8), and also doesn't complain
for certain ICU collations such as "fr-x-icu", one of
the 1595 that don't segfault with this data and work_mem
at 128MB (vs 146 that happen to segfault)
If someone wants to reproduce the problem, I've made
a custom dump (40MB) available at
http://www.manitou-mail.org/vrac/words_test.dump
query: select count(distinct wordtext collate "bs-x-icu") from words_test;
work_mem: 128MB or less, shared_buffers: 128MB
OS: debian 8 x86_64
ICU: 52.1, upstream or deb8
locale environment from which initdb was called:
$ locale -a
C
C.UTF-8
de_DE.utf8
en_US.utf8
français
french
fr_FR
fr_FR.iso88591
fr_FR.utf8
pl_PL
pl_PL.iso88592
polish
POSIX
Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Thu, Aug 03, 2017 at 11:42:25AM -0700, Peter Geoghegan wrote:
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
With query #2 it ends up crashing after ~5hours and produces
the log in log-valgrind-2.txt.gz with some other entries than
case #1, but AFAICS still all about reading uninitialised values
in space allocated by datumCopy().Right. This part is really interesting to me:
==48827== Uninitialised value was created by a heap allocation
==48827== at 0x4C28C20: malloc (vg_replace_malloc.c:296)
==48827== by 0x80B597: AllocSetAlloc (aset.c:771)
==48827== by 0x810ADC: palloc (mcxt.c:862)
==48827== by 0x72BFEF: datumCopy (datum.c:171)
==48827== by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515)
==48827== by 0x5E91EB: advance_aggregates (nodeAgg.c:1023)If you actually go to datum.c:171, you see that that's a codepath for
pass-by-reference datatypes that lack a varlena header. Text is a
datatype that has a varlena header, though, so that's clearly wrong. I
don't know how this actually happened, but working back through the
relevant tuplesort_begin_datum() caller, initialize_aggregate(), does
suggest some things. (tuplesort_begin_datum() is where datumTypeLen is
determined for the entire datum tuplesort.)I am once again only guessing, but I have to wonder if this is a
problem in commit b8d7f053. It seems likely that the problem begins
before tuplesort_begin_datum() is even called, which is the basis of
this suspicion. If the problem is within tuplesort, then that could
only be because get_typlenbyval() gives wrong answers, which seems
very unlikely.
[Action required within three days. This is a generic notification.]
The above-described topic is currently a PostgreSQL 10 open item. Peter
(Eisentraut), since you committed the patch believed to have created it, you
own this open item. If some other commit is more relevant or if this does not
belong as a v10 open item, please let us know. Otherwise, please observe the
policy on open item ownership[1]/messages/by-id/20170404140717.GA2675809@tornado.leadboat.com and send a status update within three
calendar days of this message. Include a date for your subsequent status
update. Testers may discover new open items at any time, and I want to plan
to get them all fixed well in advance of shipping v10. Consequently, I will
appreciate your efforts toward speedy resolution. Thanks.
[1]: /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Adding -hackers.
On Sat, Aug 05, 2017 at 03:55:13PM -0700, Noah Misch wrote:
On Thu, Aug 03, 2017 at 11:42:25AM -0700, Peter Geoghegan wrote:
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
With query #2 it ends up crashing after ~5hours and produces
the log in log-valgrind-2.txt.gz with some other entries than
case #1, but AFAICS still all about reading uninitialised values
in space allocated by datumCopy().Right. This part is really interesting to me:
==48827== Uninitialised value was created by a heap allocation
==48827== at 0x4C28C20: malloc (vg_replace_malloc.c:296)
==48827== by 0x80B597: AllocSetAlloc (aset.c:771)
==48827== by 0x810ADC: palloc (mcxt.c:862)
==48827== by 0x72BFEF: datumCopy (datum.c:171)
==48827== by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515)
==48827== by 0x5E91EB: advance_aggregates (nodeAgg.c:1023)If you actually go to datum.c:171, you see that that's a codepath for
pass-by-reference datatypes that lack a varlena header. Text is a
datatype that has a varlena header, though, so that's clearly wrong. I
don't know how this actually happened, but working back through the
relevant tuplesort_begin_datum() caller, initialize_aggregate(), does
suggest some things. (tuplesort_begin_datum() is where datumTypeLen is
determined for the entire datum tuplesort.)I am once again only guessing, but I have to wonder if this is a
problem in commit b8d7f053. It seems likely that the problem begins
before tuplesort_begin_datum() is even called, which is the basis of
this suspicion. If the problem is within tuplesort, then that could
only be because get_typlenbyval() gives wrong answers, which seems
very unlikely.[Action required within three days. This is a generic notification.]
The above-described topic is currently a PostgreSQL 10 open item. Peter
(Eisentraut), since you committed the patch believed to have created it, you
own this open item. If some other commit is more relevant or if this does not
belong as a v10 open item, please let us know. Otherwise, please observe the
policy on open item ownership[1] and send a status update within three
calendar days of this message. Include a date for your subsequent status
update. Testers may discover new open items at any time, and I want to plan
to get them all fixed well in advance of shipping v10. Consequently, I will
appreciate your efforts toward speedy resolution. Thanks.[1] /messages/by-id/20170404140717.GA2675809@tornado.leadboat.com
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On 2017-08-03 11:42:25 -0700, Peter Geoghegan wrote:
On Thu, Aug 3, 2017 at 8:49 AM, Daniel Verite <daniel@manitou-mail.org> wrote:
With query #2 it ends up crashing after ~5hours and produces
the log in log-valgrind-2.txt.gz with some other entries than
case #1, but AFAICS still all about reading uninitialised values
in space allocated by datumCopy().Right. This part is really interesting to me:
==48827== Uninitialised value was created by a heap allocation
==48827== at 0x4C28C20: malloc (vg_replace_malloc.c:296)
==48827== by 0x80B597: AllocSetAlloc (aset.c:771)
==48827== by 0x810ADC: palloc (mcxt.c:862)
==48827== by 0x72BFEF: datumCopy (datum.c:171)
==48827== by 0x81A74C: tuplesort_putdatum (tuplesort.c:1515)
==48827== by 0x5E91EB: advance_aggregates (nodeAgg.c:1023)If you actually go to datum.c:171, you see that that's a codepath for
pass-by-reference datatypes that lack a varlena header. Text is a
datatype that has a varlena header, though, so that's clearly wrong. I
don't know how this actually happened, but working back through the
relevant tuplesort_begin_datum() caller, initialize_aggregate(), does
suggest some things. (tuplesort_begin_datum() is where datumTypeLen is
determined for the entire datum tuplesort.)I am once again only guessing, but I have to wonder if this is a
problem in commit b8d7f053. It seems likely that the problem begins
before tuplesort_begin_datum() is even called, which is the basis of
this suspicion. If the problem is within tuplesort, then that could
only be because get_typlenbyval() gives wrong answers, which seems
very unlikely.
Not saying it's not the fault of b8d7f053 et al, but I don't quite see
how - whether something is a varlena tuple or not isn't really something
expression evaluation has an influence over if it doesn't happen from
within its code. That's the responsibility of the calling code, not from
within the datum. So I don't quite understand how you got to b8d7f053?
- Andres
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Sat, Aug 5, 2017 at 4:03 PM, Andres Freund <andres@anarazel.de> wrote:
Not saying it's not the fault of b8d7f053 et al, but I don't quite see
how - whether something is a varlena tuple or not isn't really something
expression evaluation has an influence over if it doesn't happen from
within its code. That's the responsibility of the calling code, not from
within the datum. So I don't quite understand how you got to b8d7f053?
It was a guess based on nothing more than what codepaths changed since
the last release.
While I think it's fair to assume for the time being that this is an
ICU bug, based on the fact that Valgrind won't complain when other
collations are used with the same data, I have a really hard time
figuring out why the datum tuplesort thinks that text is a
pass-by-reference type without a varlena header. That doesn't smell
like a bug in the ICU feature to me. Could an ICU bug really have
somehow made unrelated syscache lookups give wrong answers about
built-in types?
I'll try to find time to reproduce the problem next week. It should be
possible to work backwards from the weird issue within datumCopy(),
and figure out what the problem is.
--
Peter Geoghegan
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
"Daniel Verite" <daniel@manitou-mail.org> writes:
If someone wants to reproduce the problem, I've made
a custom dump (40MB) available at
http://www.manitou-mail.org/vrac/words_test.dump
Thanks for providing this test data. I've attempted, so far
unsuccessfully, to reproduce the crash. I'm using today's HEAD PG code,
in --enable-debug --enable-cassert configuration, all parameters default
except work_mem = 128MB. I see no crash with any ICU collation provided
by these configurations:
* RHEL 6's stock version of ICU 4.2, on x86_64
* Direct-from-upstream-source build of icu4c-57_1-src.tgz on
current macOS, also x86_64.
So while that isn't helping all that much, it seems to constrain the range
of affected ICU versions further than we knew before.
I'm quite disturbed though that the set of installed collations on these
two test cases seem to be entirely different both from each other and from
what you reported. The base collations look generally similar, but the
"keyword variant" versions are not comparable at all. Considering that
the entire reason we are interested in ICU in the first place is its
alleged cross-version collation behavior stability, this gives me the
exact opposite of a warm fuzzy feeling. We need to understand why it's
like that and what we can do to reduce the variation, or else we're just
buying our users enormous future pain. At least with the libc collations,
you can expect that if you have en_US.utf8 available today you will
probably still have en_US.utf8 available tomorrow. I am not seeing any
reason to believe that the same holds for ICU collations.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
I wrote:
"Daniel Verite" <daniel@manitou-mail.org> writes:
If someone wants to reproduce the problem, I've made
a custom dump (40MB) available at
http://www.manitou-mail.org/vrac/words_test.dump
Thanks for providing this test data. I've attempted, so far
unsuccessfully, to reproduce the crash.
OK, so after installing ICU 52.1 from upstream sources, I can reproduce
these problems, and I think there's nothing to see here except ICU bugs.
In the case where Daniel reports huge core dumps, the problem seems to be
that ucol_strcollRegular() gets into an infinite loop here:
// We fetch CEs until we hit a non ignorable primary or end.
uint32_t sPrimary;
do {
// We get the next CE
sOrder = ucol_IGetNextCE(coll, sColl, status);
// Stuff it in the buffer
UCOL_CEBUF_PUT(&sCEs, sOrder, sColl, status);
// And keep just the primary part.
sPrimary = sOrder & UCOL_PRIMARYMASK;
} while(sPrimary == 0);
It keeps on stuffing tokens into "sCEs", which is an expansible buffer in
the same spirit as our StringInfos, and once that gets past 2^31 tokens
all hell breaks loose, because the expansion code is entirely unconcerned
about the possibility of integer overflow:
void ucol_CEBuf_Expand(ucol_CEBuf *b, collIterate *ci, UErrorCode *status) {
uint32_t oldSize;
uint32_t newSize;
uint32_t *newBuf;
ci->flags |= UCOL_ITER_ALLOCATED;
oldSize = (uint32_t)(b->pos - b->buf);
newSize = oldSize * 2;
newBuf = (uint32_t *)uprv_malloc(newSize * sizeof(uint32_t));
if(newBuf == NULL) {
*status = U_MEMORY_ALLOCATION_ERROR;
}
else {
uprv_memcpy(newBuf, b->buf, oldSize * sizeof(uint32_t));
...
The buffer size seems to always be a power of 2. After enough iterations
oldSize is 2^31, so newSize overflows to zero, and it allocates a
zero-length newBuf and proceeds to copy ~8GB of data there. No surprise
that we get a SIGSEGV, but before that happens it'll have trashed a ton of
memory. I suspect whatever problems valgrind is reporting are just false
positives induced by that massive memory stomp.
So there are two separate bugs there --- one is that that loop fails to
terminate, for reasons that are doubtless specific to particular collation
data, and the other is that the expansible-buffer code is not robust.
The other class of failures amounts to this loop iterating till it falls
off the end of memory:
while(schar > (tchar = *(UCharOffset+offset))) { /* since the contraction codepoints should be ordered, we skip all that are smaller */
offset++;
}
which is unsurprising, because (in my core dump) schar is 1113834 which is
larger than any possible UChar value, so the loop cannot terminate except
by crashing. The crash occurred while trying to process this string:
buf2 = 0x1614d20 "requ\364\217\273\252te",
and I do not think it's coincidence that that multibyte character
there corresponds to U+10FEEA or decimal 1113834. Apparently
they've got some bugs with dealing with characters beyond U+FFFF,
at least in certain locales.
In short then, ICU 52.1 is just too buggy to contemplate using.
I haven't compared versions to see when these issues were introduced
or fixed. But I'm now thinking that trying to support old ICU
versions is a mistake, and we'd better serve our users by telling
them not to use ICU before some-version-after-52.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Sat, Aug 5, 2017 at 8:26 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
I'm quite disturbed though that the set of installed collations on these
two test cases seem to be entirely different both from each other and from
what you reported. The base collations look generally similar, but the
"keyword variant" versions are not comparable at all. Considering that
the entire reason we are interested in ICU in the first place is its
alleged cross-version collation behavior stability, this gives me the
exact opposite of a warm fuzzy feeling. We need to understand why it's
like that and what we can do to reduce the variation, or else we're just
buying our users enormous future pain. At least with the libc collations,
you can expect that if you have en_US.utf8 available today you will
probably still have en_US.utf8 available tomorrow. I am not seeing any
reason to believe that the same holds for ICU collations.
+1. That seems like something that is important to get right up-front.
--
Peter Geoghegan
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
On Sun, Aug 6, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
So there are two separate bugs there --- one is that that loop fails to
terminate, for reasons that are doubtless specific to particular collation
data, and the other is that the expansible-buffer code is not robust.
Interesting.
In short then, ICU 52.1 is just too buggy to contemplate using.
I haven't compared versions to see when these issues were introduced
or fixed. But I'm now thinking that trying to support old ICU
versions is a mistake, and we'd better serve our users by telling
them not to use ICU before some-version-after-52.
Distributions generally back-patch fixes to older ICU versions. What
upstream source did you use, exactly?
It's possible that this bug could be fixed by the Debian maintainer.
This is supposed to be a very stable version of the library. It has
received security updates fairly recently [1]http://metadata.ftp-master.debian.org/changelogs/main/i/icu/icu_52.1-8+deb8u5_changelog -- Peter Geoghegan. Maybe this is actually
a regression caused by a badly handled backpatch, like the infamous
random number bug that only appeared in Debian's OpenSSL. Many
important software packages will have a dependency on the Debian 52.1
ICU package, and it really shouldn't be a liability to support it.
(Granted, it does appear to be one right now.)
[1]: http://metadata.ftp-master.debian.org/changelogs/main/i/icu/icu_52.1-8+deb8u5_changelog -- Peter Geoghegan
--
Peter Geoghegan
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Peter Geoghegan <pg@bowt.ie> writes:
On Sun, Aug 6, 2017 at 1:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
In short then, ICU 52.1 is just too buggy to contemplate using.
Distributions generally back-patch fixes to older ICU versions. What
upstream source did you use, exactly?
I went to http://www.icu-project.org/ and downloaded icu4c-52_1-src.tgz.
All the file dates therein seem to be 2013-10-04.
Debian, for one, is evidently not trying very hard in that direction,
since not only are the bugs still there but the line numbers I saw in
my backtraces agreed with Daniel's, indicating they've not changed
much of anything at all in ucol.cpp.
regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-bugs@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs