jsonb format is pessimal for toast compression
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed. (first_success_by is 1024 in the default set of
compression parameters.)
As an example, here's gdb's report of the bitwise representation of the
example JSON value in the bug thread:
0x2ab85ac: 0x20000005 0x00000004 0x50003098 0x0000309f
0x2ab85bc: 0x000030ae 0x000030b8 0x000030cf 0x000030da
0x2ab85cc: 0x000030df 0x000030ee 0x00003105 0x6b6e756a
0x2ab85dc: 0x400000de 0x00000034 0x00000068 0x0000009c
0x2ab85ec: 0x000000d0 0x00000104 0x00000138 0x0000016c
0x2ab85fc: 0x000001a0 0x000001d4 0x00000208 0x0000023c
0x2ab860c: 0x00000270 0x000002a4 0x000002d8 0x0000030c
0x2ab861c: 0x00000340 0x00000374 0x000003a8 0x000003dc
0x2ab862c: 0x00000410 0x00000444 0x00000478 0x000004ac
0x2ab863c: 0x000004e0 0x00000514 0x00000548 0x0000057c
0x2ab864c: 0x000005b0 0x000005e4 0x00000618 0x0000064c
0x2ab865c: 0x00000680 0x000006b4 0x000006e8 0x0000071c
0x2ab866c: 0x00000750 0x00000784 0x000007b8 0x000007ec
0x2ab867c: 0x00000820 0x00000854 0x00000888 0x000008bc
0x2ab868c: 0x000008f0 0x00000924 0x00000958 0x0000098c
0x2ab869c: 0x000009c0 0x000009f4 0x00000a28 0x00000a5c
0x2ab86ac: 0x00000a90 0x00000ac4 0x00000af8 0x00000b2c
0x2ab86bc: 0x00000b60 0x00000b94 0x00000bc8 0x00000bfc
0x2ab86cc: 0x00000c30 0x00000c64 0x00000c98 0x00000ccc
0x2ab86dc: 0x00000d00 0x00000d34 0x00000d68 0x00000d9c
0x2ab86ec: 0x00000dd0 0x00000e04 0x00000e38 0x00000e6c
0x2ab86fc: 0x00000ea0 0x00000ed4 0x00000f08 0x00000f3c
0x2ab870c: 0x00000f70 0x00000fa4 0x00000fd8 0x0000100c
0x2ab871c: 0x00001040 0x00001074 0x000010a8 0x000010dc
0x2ab872c: 0x00001110 0x00001144 0x00001178 0x000011ac
0x2ab873c: 0x000011e0 0x00001214 0x00001248 0x0000127c
0x2ab874c: 0x000012b0 0x000012e4 0x00001318 0x0000134c
0x2ab875c: 0x00001380 0x000013b4 0x000013e8 0x0000141c
0x2ab876c: 0x00001450 0x00001484 0x000014b8 0x000014ec
0x2ab877c: 0x00001520 0x00001554 0x00001588 0x000015bc
0x2ab878c: 0x000015f0 0x00001624 0x00001658 0x0000168c
0x2ab879c: 0x000016c0 0x000016f4 0x00001728 0x0000175c
0x2ab87ac: 0x00001790 0x000017c4 0x000017f8 0x0000182c
0x2ab87bc: 0x00001860 0x00001894 0x000018c8 0x000018fc
0x2ab87cc: 0x00001930 0x00001964 0x00001998 0x000019cc
0x2ab87dc: 0x00001a00 0x00001a34 0x00001a68 0x00001a9c
0x2ab87ec: 0x00001ad0 0x00001b04 0x00001b38 0x00001b6c
0x2ab87fc: 0x00001ba0 0x00001bd4 0x00001c08 0x00001c3c
0x2ab880c: 0x00001c70 0x00001ca4 0x00001cd8 0x00001d0c
0x2ab881c: 0x00001d40 0x00001d74 0x00001da8 0x00001ddc
0x2ab882c: 0x00001e10 0x00001e44 0x00001e78 0x00001eac
0x2ab883c: 0x00001ee0 0x00001f14 0x00001f48 0x00001f7c
0x2ab884c: 0x00001fb0 0x00001fe4 0x00002018 0x0000204c
0x2ab885c: 0x00002080 0x000020b4 0x000020e8 0x0000211c
0x2ab886c: 0x00002150 0x00002184 0x000021b8 0x000021ec
0x2ab887c: 0x00002220 0x00002254 0x00002288 0x000022bc
0x2ab888c: 0x000022f0 0x00002324 0x00002358 0x0000238c
0x2ab889c: 0x000023c0 0x000023f4 0x00002428 0x0000245c
0x2ab88ac: 0x00002490 0x000024c4 0x000024f8 0x0000252c
0x2ab88bc: 0x00002560 0x00002594 0x000025c8 0x000025fc
0x2ab88cc: 0x00002630 0x00002664 0x00002698 0x000026cc
0x2ab88dc: 0x00002700 0x00002734 0x00002768 0x0000279c
0x2ab88ec: 0x000027d0 0x00002804 0x00002838 0x0000286c
0x2ab88fc: 0x000028a0 0x000028d4 0x00002908 0x0000293c
0x2ab890c: 0x00002970 0x000029a4 0x000029d8 0x00002a0c
0x2ab891c: 0x00002a40 0x00002a74 0x00002aa8 0x00002adc
0x2ab892c: 0x00002b10 0x00002b44 0x00002b78 0x00002bac
0x2ab893c: 0x00002be0 0x00002c14 0x00002c48 0x00002c7c
0x2ab894c: 0x00002cb0 0x00002ce4 0x00002d18 0x32343231
0x2ab895c: 0x74653534 0x74656577 0x33746577 0x77673534
0x2ab896c: 0x74657274 0x33347477 0x72777120 0x20717771
0x2ab897c: 0x65727771 0x20777120 0x66647372 0x73616b6c
0x2ab898c: 0x33353471 0x71772035 0x72777172 0x71727771
0x2ab899c: 0x77203277 0x72777172 0x71727771 0x33323233
0x2ab89ac: 0x6b207732 0x20657773 0x73616673 0x73207372
0x2ab89bc: 0x64736664 0x32343231 0x74653534 0x74656577
0x2ab89cc: 0x33746577 0x77673534 0x74657274 0x33347477
0x2ab89dc: 0x72777120 0x20717771 0x65727771 0x20777120
0x2ab89ec: 0x66647372 0x73616b6c 0x33353471 0x71772035
0x2ab89fc: 0x72777172 0x71727771 0x77203277 0x72777172
0x2ab8a0c: 0x71727771 0x33323233 0x6b207732 0x20657773
0x2ab8a1c: 0x73616673 0x73207372 0x64736664 0x32343231
0x2ab8a2c: 0x74653534 0x74656577 0x33746577 0x77673534
0x2ab8a3c: 0x74657274 0x33347477 0x72777120 0x20717771
0x2ab8a4c: 0x65727771 0x20777120 0x66647372 0x73616b6c
0x2ab8a5c: 0x33353471 0x71772035 0x72777172 0x71727771
0x2ab8a6c: 0x77203277 0x72777172 0x71727771 0x33323233
0x2ab8a7c: 0x6b207732 0x20657773 0x73616673 0x73207372
0x2ab8a8c: 0x64736664 0x32343231 0x74653534 0x74656577
0x2ab8a9c: 0x33746577 0x77673534 0x74657274 0x33347477
0x2ab8aac: 0x72777120 0x20717771 0x65727771 0x20777120
0x2ab8abc: 0x66647372 0x73616b6c 0x33353471 0x71772035
0x2ab8acc: 0x72777172 0x71727771 0x77203277 0x72777172
0x2ab8adc: 0x71727771 0x33323233 0x6b207732 0x20657773
0x2ab8aec: 0x73616673 0x73207372 0x64736664 0x32343231
0x2ab8afc: 0x74653534 0x74656577 0x33746577 0x77673534
...
0x2abb61c: 0x74657274 0x33347477 0x72777120 0x20717771
0x2abb62c: 0x65727771 0x20777120 0x66647372 0x73616b6c
0x2abb63c: 0x33353471 0x71772035 0x72777172 0x71727771
0x2abb64c: 0x77203277 0x72777172 0x71727771 0x33323233
0x2abb65c: 0x6b207732 0x20657773 0x73616673 0x73207372
0x2abb66c: 0x64736664 0x537a6962 0x41706574 0x73756220
0x2abb67c: 0x73656e69 0x74732073 0x45617065 0x746e6576
0x2abb68c: 0x656d6954 0x34313032 0x2d38302d 0x32203730
0x2abb69c: 0x33323a31 0x2e33333a 0x62393434 0x6f4c7a69
0x2abb6ac: 0x69746163 0x61506e6f 0x74736972 0x736e6172
0x2abb6bc: 0x69746361 0x61446e6f 0x30326574 0x302d3431
0x2abb6cc: 0x37302d38 0x3a313220 0x333a3332 0x34342e33
There is plenty of compressible data once we get into the repetitive
strings in the payload part --- but that starts at offset 944, and up to
that point there is nothing that pg_lzcompress can get a handle on. There
are, by definition, no sequences of 4 or more repeated bytes in that area.
I think in principle pg_lzcompress could decide to compress the 3-byte
sequences consisting of the high-order 24 bits of each offset; but it
doesn't choose to do so, probably because of the way its lookup hash table
works:
* pglz_hist_idx -
*
* Computes the history table slot for the lookup by the next 4
* characters in the input.
*
* NB: because we use the next 4 characters, we are not guaranteed to
* find 3-character matches; they very possibly will be in the wrong
* hash list. This seems an acceptable tradeoff for spreading out the
* hash keys more.
For jsonb header data, the "next 4 characters" are *always* different, so
only a chance hash collision can result in a match. There is therefore a
pretty good chance that no compression will occur before it gives up
because of first_success_by.
I'm not sure if there is any easy fix for this. We could possibly change
the default first_success_by value, but I think that'd just be postponing
the problem to larger jsonb objects/arrays, and it would hurt performance
for genuinely incompressible data. A somewhat painful, but not yet
out-of-the-question, alternative is to change the jsonb on-disk
representation. Perhaps the JEntry array could be defined as containing
element lengths instead of element ending offsets. Not sure though if
that would break binary searching for JSON object keys.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Apologies if this is a ridiculous suggestion, but I believe that swapping
out the compression algorithm (for Snappy, for example) has been discussed
in the past. I wonder if that algorithm is sufficiently different that it
would produce a better result, and if that might not be preferable to some
of the other options.
On Thu, Aug 7, 2014 at 11:17 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed. (first_success_by is 1024 in the default set of
compression parameters.)As an example, here's gdb's report of the bitwise representation of the
example JSON value in the bug thread:0x2ab85ac: 0x20000005 0x00000004 0x50003098 0x0000309f
0x2ab85bc: 0x000030ae 0x000030b8 0x000030cf 0x000030da
0x2ab85cc: 0x000030df 0x000030ee 0x00003105 0x6b6e756a
0x2ab85dc: 0x400000de 0x00000034 0x00000068 0x0000009c
0x2ab85ec: 0x000000d0 0x00000104 0x00000138 0x0000016c
0x2ab85fc: 0x000001a0 0x000001d4 0x00000208 0x0000023c
0x2ab860c: 0x00000270 0x000002a4 0x000002d8 0x0000030c
0x2ab861c: 0x00000340 0x00000374 0x000003a8 0x000003dc
0x2ab862c: 0x00000410 0x00000444 0x00000478 0x000004ac
0x2ab863c: 0x000004e0 0x00000514 0x00000548 0x0000057c
0x2ab864c: 0x000005b0 0x000005e4 0x00000618 0x0000064c
0x2ab865c: 0x00000680 0x000006b4 0x000006e8 0x0000071c
0x2ab866c: 0x00000750 0x00000784 0x000007b8 0x000007ec
0x2ab867c: 0x00000820 0x00000854 0x00000888 0x000008bc
0x2ab868c: 0x000008f0 0x00000924 0x00000958 0x0000098c
0x2ab869c: 0x000009c0 0x000009f4 0x00000a28 0x00000a5c
0x2ab86ac: 0x00000a90 0x00000ac4 0x00000af8 0x00000b2c
0x2ab86bc: 0x00000b60 0x00000b94 0x00000bc8 0x00000bfc
0x2ab86cc: 0x00000c30 0x00000c64 0x00000c98 0x00000ccc
0x2ab86dc: 0x00000d00 0x00000d34 0x00000d68 0x00000d9c
0x2ab86ec: 0x00000dd0 0x00000e04 0x00000e38 0x00000e6c
0x2ab86fc: 0x00000ea0 0x00000ed4 0x00000f08 0x00000f3c
0x2ab870c: 0x00000f70 0x00000fa4 0x00000fd8 0x0000100c
0x2ab871c: 0x00001040 0x00001074 0x000010a8 0x000010dc
0x2ab872c: 0x00001110 0x00001144 0x00001178 0x000011ac
0x2ab873c: 0x000011e0 0x00001214 0x00001248 0x0000127c
0x2ab874c: 0x000012b0 0x000012e4 0x00001318 0x0000134c
0x2ab875c: 0x00001380 0x000013b4 0x000013e8 0x0000141c
0x2ab876c: 0x00001450 0x00001484 0x000014b8 0x000014ec
0x2ab877c: 0x00001520 0x00001554 0x00001588 0x000015bc
0x2ab878c: 0x000015f0 0x00001624 0x00001658 0x0000168c
0x2ab879c: 0x000016c0 0x000016f4 0x00001728 0x0000175c
0x2ab87ac: 0x00001790 0x000017c4 0x000017f8 0x0000182c
0x2ab87bc: 0x00001860 0x00001894 0x000018c8 0x000018fc
0x2ab87cc: 0x00001930 0x00001964 0x00001998 0x000019cc
0x2ab87dc: 0x00001a00 0x00001a34 0x00001a68 0x00001a9c
0x2ab87ec: 0x00001ad0 0x00001b04 0x00001b38 0x00001b6c
0x2ab87fc: 0x00001ba0 0x00001bd4 0x00001c08 0x00001c3c
0x2ab880c: 0x00001c70 0x00001ca4 0x00001cd8 0x00001d0c
0x2ab881c: 0x00001d40 0x00001d74 0x00001da8 0x00001ddc
0x2ab882c: 0x00001e10 0x00001e44 0x00001e78 0x00001eac
0x2ab883c: 0x00001ee0 0x00001f14 0x00001f48 0x00001f7c
0x2ab884c: 0x00001fb0 0x00001fe4 0x00002018 0x0000204c
0x2ab885c: 0x00002080 0x000020b4 0x000020e8 0x0000211c
0x2ab886c: 0x00002150 0x00002184 0x000021b8 0x000021ec
0x2ab887c: 0x00002220 0x00002254 0x00002288 0x000022bc
0x2ab888c: 0x000022f0 0x00002324 0x00002358 0x0000238c
0x2ab889c: 0x000023c0 0x000023f4 0x00002428 0x0000245c
0x2ab88ac: 0x00002490 0x000024c4 0x000024f8 0x0000252c
0x2ab88bc: 0x00002560 0x00002594 0x000025c8 0x000025fc
0x2ab88cc: 0x00002630 0x00002664 0x00002698 0x000026cc
0x2ab88dc: 0x00002700 0x00002734 0x00002768 0x0000279c
0x2ab88ec: 0x000027d0 0x00002804 0x00002838 0x0000286c
0x2ab88fc: 0x000028a0 0x000028d4 0x00002908 0x0000293c
0x2ab890c: 0x00002970 0x000029a4 0x000029d8 0x00002a0c
0x2ab891c: 0x00002a40 0x00002a74 0x00002aa8 0x00002adc
0x2ab892c: 0x00002b10 0x00002b44 0x00002b78 0x00002bac
0x2ab893c: 0x00002be0 0x00002c14 0x00002c48 0x00002c7c
0x2ab894c: 0x00002cb0 0x00002ce4 0x00002d18 0x32343231
0x2ab895c: 0x74653534 0x74656577 0x33746577 0x77673534
0x2ab896c: 0x74657274 0x33347477 0x72777120 0x20717771
0x2ab897c: 0x65727771 0x20777120 0x66647372 0x73616b6c
0x2ab898c: 0x33353471 0x71772035 0x72777172 0x71727771
0x2ab899c: 0x77203277 0x72777172 0x71727771 0x33323233
0x2ab89ac: 0x6b207732 0x20657773 0x73616673 0x73207372
0x2ab89bc: 0x64736664 0x32343231 0x74653534 0x74656577
0x2ab89cc: 0x33746577 0x77673534 0x74657274 0x33347477
0x2ab89dc: 0x72777120 0x20717771 0x65727771 0x20777120
0x2ab89ec: 0x66647372 0x73616b6c 0x33353471 0x71772035
0x2ab89fc: 0x72777172 0x71727771 0x77203277 0x72777172
0x2ab8a0c: 0x71727771 0x33323233 0x6b207732 0x20657773
0x2ab8a1c: 0x73616673 0x73207372 0x64736664 0x32343231
0x2ab8a2c: 0x74653534 0x74656577 0x33746577 0x77673534
0x2ab8a3c: 0x74657274 0x33347477 0x72777120 0x20717771
0x2ab8a4c: 0x65727771 0x20777120 0x66647372 0x73616b6c
0x2ab8a5c: 0x33353471 0x71772035 0x72777172 0x71727771
0x2ab8a6c: 0x77203277 0x72777172 0x71727771 0x33323233
0x2ab8a7c: 0x6b207732 0x20657773 0x73616673 0x73207372
0x2ab8a8c: 0x64736664 0x32343231 0x74653534 0x74656577
0x2ab8a9c: 0x33746577 0x77673534 0x74657274 0x33347477
0x2ab8aac: 0x72777120 0x20717771 0x65727771 0x20777120
0x2ab8abc: 0x66647372 0x73616b6c 0x33353471 0x71772035
0x2ab8acc: 0x72777172 0x71727771 0x77203277 0x72777172
0x2ab8adc: 0x71727771 0x33323233 0x6b207732 0x20657773
0x2ab8aec: 0x73616673 0x73207372 0x64736664 0x32343231
0x2ab8afc: 0x74653534 0x74656577 0x33746577 0x77673534
...
0x2abb61c: 0x74657274 0x33347477 0x72777120 0x20717771
0x2abb62c: 0x65727771 0x20777120 0x66647372 0x73616b6c
0x2abb63c: 0x33353471 0x71772035 0x72777172 0x71727771
0x2abb64c: 0x77203277 0x72777172 0x71727771 0x33323233
0x2abb65c: 0x6b207732 0x20657773 0x73616673 0x73207372
0x2abb66c: 0x64736664 0x537a6962 0x41706574 0x73756220
0x2abb67c: 0x73656e69 0x74732073 0x45617065 0x746e6576
0x2abb68c: 0x656d6954 0x34313032 0x2d38302d 0x32203730
0x2abb69c: 0x33323a31 0x2e33333a 0x62393434 0x6f4c7a69
0x2abb6ac: 0x69746163 0x61506e6f 0x74736972 0x736e6172
0x2abb6bc: 0x69746361 0x61446e6f 0x30326574 0x302d3431
0x2abb6cc: 0x37302d38 0x3a313220 0x333a3332 0x34342e33There is plenty of compressible data once we get into the repetitive
strings in the payload part --- but that starts at offset 944, and up to
that point there is nothing that pg_lzcompress can get a handle on. There
are, by definition, no sequences of 4 or more repeated bytes in that area.
I think in principle pg_lzcompress could decide to compress the 3-byte
sequences consisting of the high-order 24 bits of each offset; but it
doesn't choose to do so, probably because of the way its lookup hash table
works:* pglz_hist_idx -
*
* Computes the history table slot for the lookup by the next
4
* characters in the input.
*
* NB: because we use the next 4 characters, we are not guaranteed to
* find 3-character matches; they very possibly will be in the wrong
* hash list. This seems an acceptable tradeoff for spreading out the
* hash keys more.For jsonb header data, the "next 4 characters" are *always* different, so
only a chance hash collision can result in a match. There is therefore a
pretty good chance that no compression will occur before it gives up
because of first_success_by.I'm not sure if there is any easy fix for this. We could possibly change
the default first_success_by value, but I think that'd just be postponing
the problem to larger jsonb objects/arrays, and it would hurt performance
for genuinely incompressible data. A somewhat painful, but not yet
out-of-the-question, alternative is to change the jsonb on-disk
representation. Perhaps the JEntry array could be defined as containing
element lengths instead of element ending offsets. Not sure though if
that would break binary searching for JSON object keys.regards, tom lane
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed. (first_success_by is 1024 in the default set of
compression parameters.)
I haven't looked at this in any detail, so take this with a grain of
salt, but what about teaching pglz_compress about using an offset
farther into the data, if the incoming data is quite a bit larger than
1k? This is just a test to see if it's worthwhile to keep going, no? I
wonder if this might even be able to be provided as a type-specific
option, to avoid changing the behavior for types other than jsonb in
this regard.
(I'm imaginging a boolean saying "pick a random sample", or perhaps a
function which can be called that'll return "here's where you wanna test
if this thing is gonna compress at all")
I'm rather disinclined to change the on-disk format because of this
specific test, that feels a bit like the tail wagging the dog to me,
especially as I do hope that some day we'll figure out a way to use a
better compression algorithm than pglz.
Thanks,
Stephen
On Fri, Aug 8, 2014 at 10:48 AM, Stephen Frost <sfrost@snowman.net> wrote:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I looked into the issue reported in bug #11109. The problem appears to
be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible,because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed. (first_success_by is 1024 in the default set of
compression parameters.)I haven't looked at this in any detail, so take this with a grain of
salt, but what about teaching pglz_compress about using an offset
farther into the data, if the incoming data is quite a bit larger than
1k? This is just a test to see if it's worthwhile to keep going, no? I
wonder if this might even be able to be provided as a type-specific
option, to avoid changing the behavior for types other than jsonb in
this regard.
+1 for offset. Or sample the data in the beginning, middle and end.
Obviously one could always come up with worst case, but.
(I'm imaginging a boolean saying "pick a random sample", or perhaps a
function which can be called that'll return "here's where you wanna test
if this thing is gonna compress at all")I'm rather disinclined to change the on-disk format because of this
specific test, that feels a bit like the tail wagging the dog to me,
especially as I do hope that some day we'll figure out a way to use a
better compression algorithm than pglz.Thanks,
Stephen
--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
On 08/07/2014 11:17 PM, Tom Lane wrote:
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed. (first_success_by is 1024 in the default set of
compression parameters.)
[snip]
There is plenty of compressible data once we get into the repetitive
strings in the payload part --- but that starts at offset 944, and up to
that point there is nothing that pg_lzcompress can get a handle on. There
are, by definition, no sequences of 4 or more repeated bytes in that area.
I think in principle pg_lzcompress could decide to compress the 3-byte
sequences consisting of the high-order 24 bits of each offset; but it
doesn't choose to do so, probably because of the way its lookup hash table
works:* pglz_hist_idx -
*
* Computes the history table slot for the lookup by the next 4
* characters in the input.
*
* NB: because we use the next 4 characters, we are not guaranteed to
* find 3-character matches; they very possibly will be in the wrong
* hash list. This seems an acceptable tradeoff for spreading out the
* hash keys more.For jsonb header data, the "next 4 characters" are *always* different, so
only a chance hash collision can result in a match. There is therefore a
pretty good chance that no compression will occur before it gives up
because of first_success_by.I'm not sure if there is any easy fix for this. We could possibly change
the default first_success_by value, but I think that'd just be postponing
the problem to larger jsonb objects/arrays, and it would hurt performance
for genuinely incompressible data. A somewhat painful, but not yet
out-of-the-question, alternative is to change the jsonb on-disk
representation. Perhaps the JEntry array could be defined as containing
element lengths instead of element ending offsets. Not sure though if
that would break binary searching for JSON object keys.
Ouch.
Back when this structure was first presented at pgCon 2013, I wondered
if we shouldn't extract the strings into a dictionary, because of key
repetition, and convinced myself that this shouldn't be necessary
because in significant cases TOAST would take care of it.
Maybe we should have pglz_compress() look at the *last* 1024 bytes if it
can't find anything worth compressing in the first, for values larger
than a certain size.
It's worth noting that this is a fairly pathological case. AIUI the
example you constructed has an array with 100k string elements. I don't
think that's typical. So I suspect that unless I've misunderstood the
statement of the problem we're going to find that almost all the jsonb
we will be storing is still compressible.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed. (first_success_by is 1024 in the default set of
compression parameters.)
I haven't looked at this in any detail, so take this with a grain of
salt, but what about teaching pglz_compress about using an offset
farther into the data, if the incoming data is quite a bit larger than
1k? This is just a test to see if it's worthwhile to keep going, no?
Well, the point of the existing approach is that it's a *nearly free*
test to see if it's worthwhile to keep going; there's just one if-test
added in the outer loop of the compression code. (cf commit ad434473ebd2,
which added that along with some other changes.) AFAICS, what we'd have
to do to do it as you suggest would to execute compression on some subset
of the data and then throw away that work entirely. I do not find that
attractive, especially when for most datatypes there's no particular
reason to look at one subset instead of another.
I'm rather disinclined to change the on-disk format because of this
specific test, that feels a bit like the tail wagging the dog to me,
especially as I do hope that some day we'll figure out a way to use a
better compression algorithm than pglz.
I'm unimpressed by that argument too, for a number of reasons:
1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input is very
repetitive. That's a compression-unfriendly transformation by anyone's
measure. Assuming that some future replacement for pg_lzcompress() will
nonetheless be able to compress the data strikes me as mostly wishful
thinking. Besides, we'd more than likely have a similar early-exit rule
in any substitute implementation, so that we'd still be at risk even if
it usually worked.
2. Are we going to ship 9.4 without fixing this? I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.
Considering all the hype that's built up around jsonb, shipping a design
with a fundamental performance handicap doesn't seem like a good plan
to me. We could perhaps band-aid around it by using different compression
parameters for jsonb, although that would require some painful API changes
since toast_compress_datum() doesn't know what datatype it's operating on.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
On 08/07/2014 11:17 PM, Tom Lane wrote:
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
Ouch.
Back when this structure was first presented at pgCon 2013, I wondered
if we shouldn't extract the strings into a dictionary, because of key
repetition, and convinced myself that this shouldn't be necessary
because in significant cases TOAST would take care of it.
That's not really the issue here, I think. The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.
Maybe we should have pglz_compress() look at the *last* 1024 bytes if it
can't find anything worth compressing in the first, for values larger
than a certain size.
Not possible with anything like the current implementation, since it's
just an on-the-fly status check not a trial compression.
It's worth noting that this is a fairly pathological case. AIUI the
example you constructed has an array with 100k string elements. I don't
think that's typical. So I suspect that unless I've misunderstood the
statement of the problem we're going to find that almost all the jsonb
we will be storing is still compressible.
Actually, the 100K-string example I constructed *did* compress. Larry's
example that's not compressing is only about 12kB. AFAICS, the threshold
for trouble is in the vicinity of 256 array or object entries (resulting
in a 1kB JEntry array). That doesn't seem especially high. There is a
probabilistic component as to whether the early-exit case will actually
fire, since any chance hash collision will probably result in some 3-byte
offset prefix getting compressed. But the fact that a beta tester tripped
over this doesn't leave me with a warm feeling about the odds that it
won't happen much in the field.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/08/2014 11:18 AM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 08/07/2014 11:17 PM, Tom Lane wrote:
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.Back when this structure was first presented at pgCon 2013, I wondered
if we shouldn't extract the strings into a dictionary, because of key
repetition, and convinced myself that this shouldn't be necessary
because in significant cases TOAST would take care of it.That's not really the issue here, I think. The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.
It would certainly be worth validating that changing this would fix the
problem.
I don't know how invasive that would be - I suspect (without looking
very closely) not terribly much.
2. Are we going to ship 9.4 without fixing this? I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.Considering all the hype that's built up around jsonb, shipping a design
with a fundamental performance handicap doesn't seem like a good plan
to me. We could perhaps band-aid around it by using different compression
parameters for jsonb, although that would require some painful API changes
since toast_compress_datum() doesn't know what datatype it's operating on.
Yeah, it would be a bit painful, but after all finding out this sort of
thing is why we have betas.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
On 08/08/2014 11:18 AM, Tom Lane wrote:
That's not really the issue here, I think. The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.
It would certainly be worth validating that changing this would fix the
problem.
I don't know how invasive that would be - I suspect (without looking
very closely) not terribly much.
I took a quick look and saw that this wouldn't be that easy to get around.
As I'd suspected upthread, there are places that do random access into a
JEntry array, such as the binary search in findJsonbValueFromContainer().
If we have to add up all the preceding lengths to locate the corresponding
value part, we lose the performance advantages of binary search. AFAICS
that's applied directly to the on-disk representation. I'd thought
perhaps there was always a transformation step to build a pointer list,
but nope.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 8, 2014 at 8:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I'm rather disinclined to change the on-disk format because of this
specific test, that feels a bit like the tail wagging the dog to me,
especially as I do hope that some day we'll figure out a way to use a
better compression algorithm than pglz.I'm unimpressed by that argument too, for a number of reasons:
1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input is very
repetitive. That's a compression-unfriendly transformation by anyone's
measure. Assuming that some future replacement for pg_lzcompress() will
nonetheless be able to compress the data strikes me as mostly wishful
thinking. Besides, we'd more than likely have a similar early-exit rule
in any substitute implementation, so that we'd still be at risk even if
it usually worked.
Would an answer be to switch the location of the jsonb "header" data to the
end of the field as opposed to the beginning of the field? That would allow
pglz to see what it wants to see early on and go to work when possible?
Add an offset at the top of the field to show where to look - but then it
would be the same in terms of functionality outside of that? Or pretty
close?
John
John W Higgins <wishdev@gmail.com> writes:
Would an answer be to switch the location of the jsonb "header" data to the
end of the field as opposed to the beginning of the field? That would allow
pglz to see what it wants to see early on and go to work when possible?
Hm, might work. Seems a bit odd, but it would make pglz_compress happier.
OTOH, the big-picture issue here is that jsonb is generating
noncompressible data in the first place. Putting it somewhere else in the
Datum doesn't change the fact that we're going to have bloated storage,
even if we dodge the early-exit problem. (I suspect the compression
disadvantage vs text/plain-json that I showed yesterday is coming largely
from that offset array.) But I don't currently see how to avoid that and
still preserve the fast binary-search key lookup property, which is surely
a nice thing to have.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/08/2014 12:04 PM, John W Higgins wrote:
Would an answer be to switch the location of the jsonb "header" data
to the end of the field as opposed to the beginning of the field? That
would allow pglz to see what it wants to see early on and go to work
when possible?Add an offset at the top of the field to show where to look - but then
it would be the same in terms of functionality outside of that? Or
pretty close?
That might make building up jsonb structures piece by piece as we do
difficult.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/08/2014 11:54 AM, Tom Lane wrote:
Andrew Dunstan <andrew@dunslane.net> writes:
On 08/08/2014 11:18 AM, Tom Lane wrote:
That's not really the issue here, I think. The problem is that a
relatively minor aspect of the representation, namely the choice to store
a series of offsets rather than a series of lengths, produces
nonrepetitive data even when the original input is repetitive.It would certainly be worth validating that changing this would fix the
problem.
I don't know how invasive that would be - I suspect (without looking
very closely) not terribly much.I took a quick look and saw that this wouldn't be that easy to get around.
As I'd suspected upthread, there are places that do random access into a
JEntry array, such as the binary search in findJsonbValueFromContainer().
If we have to add up all the preceding lengths to locate the corresponding
value part, we lose the performance advantages of binary search. AFAICS
that's applied directly to the on-disk representation. I'd thought
perhaps there was always a transformation step to build a pointer list,
but nope.
It would be interesting to know what the performance hit would be if we
calculated the offsets/pointers on the fly, especially if we could cache
it somehow. The main benefit of binary search is in saving on
comparisons, especially of strings, ISTM, and that could still be
available - this would just be a bit of extra arithmetic.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
value-to-be-compressed. (first_success_by is 1024 in the default set of
compression parameters.)
Curious idea: we could swap JEntry array and values: values in the
begining of type will be catched by pg_lzcompress. But we will need to
know offset of JEntry array, so header will grow up till 8 bytes
(actually, it will be a varlena header!)
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Curious idea: we could swap JEntry array and values: values in the
begining of type will be catched by pg_lzcompress. But we will need to
know offset of JEntry array, so header will grow up till 8 bytes
(actually, it will be a varlena header!)
May be I wasn't clear:jsonb type will start from string collection
instead of JEntry array, JEntry array will be placed at the end of
object/array. so, pg_lzcompress will find repeatable 4-byte pieces in
first 1024 bytes of jsonb.
--
Teodor Sigaev E-mail: teodor@sigaev.ru
WWW: http://www.sigaev.ru/
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote:
2. Are we going to ship 9.4 without fixing this? I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.
FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4
betas and report the problem JSONB columns.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/08/2014 08:02 AM, Tom Lane wrote:
2. Are we going to ship 9.4 without fixing this? I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.Considering all the hype that's built up around jsonb, shipping a design
with a fundamental performance handicap doesn't seem like a good plan
to me. We could perhaps band-aid around it by using different compression
parameters for jsonb, although that would require some painful API changes
since toast_compress_datum() doesn't know what datatype it's operating on.
I would rather ship late than ship a noncompressable JSONB.
One we ship 9.4, many users are going to load 100's of GB into JSONB
fields. Even if we fix the compressability issue in 9.5, those users
won't be able to fix the compression without rewriting all their data,
which could be prohibitive. And we'll be in a position where we have
to support the 9.4 JSONB format/compression technique for years so that
users aren't blocked from upgrading.
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WM5b1d8ab825ba36a356521716b87e4bad04e3748e9ba96ed7e79460f8120b263a179985026539d283e80607f748de3855@asav-3.01.com
On Fri, Aug 8, 2014 at 9:14 PM, Teodor Sigaev <teodor@sigaev.ru> wrote:
Curious idea: we could swap JEntry array and values: values in the
begining of type will be catched by pg_lzcompress. But we will need to
know offset of JEntry array, so header will grow up till 8 bytes
(actually, it will be a varlena header!)May be I wasn't clear:jsonb type will start from string collection instead
of JEntry array, JEntry array will be placed at the end of object/array.
so, pg_lzcompress will find repeatable 4-byte pieces in first 1024 bytes of
jsonb.
Another idea I have is that store offset in each JEntry is not necessary to
have benefit of binary search. Namely what if we store offsets in each 8th
JEntry and length in others? The speed of binary search will be about the
same: overhead is only calculation offsets in the 8-entries chunk. But
lengths will probably repeat.
------
With best regards,
Alexander Korotkov.
On Fri, Aug 8, 2014 at 7:35 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
I took a quick look and saw that this wouldn't be that easy to get around.
As I'd suspected upthread, there are places that do random access into a
JEntry array, such as the binary search in findJsonbValueFromContainer().
If we have to add up all the preceding lengths to locate the corresponding
value part, we lose the performance advantages of binary search. AFAICS
that's applied directly to the on-disk representation. I'd thought
perhaps there was always a transformation step to build a pointer list,
but nope.It would be interesting to know what the performance hit would be if we
calculated the offsets/pointers on the fly, especially if we could cache it
somehow. The main benefit of binary search is in saving on comparisons,
especially of strings, ISTM, and that could still be available - this would
just be a bit of extra arithmetic.
I don't think binary search is the main problem here. Objects are
usually reasonably sized, while arrays are more likely to be huge. To
make matters worse, jsonb -> int goes from O(1) to O(n).
Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/08/2014 06:17 AM, Tom Lane wrote:
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
How hard and how expensive would it be to teach pg_lzcompress to
apply a delta filter on suitable data ?
So that instead of integers their deltas will be fed to the "real"
compressor
--
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic O�
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 8, 2014 at 12:41 PM, Ants Aasma <ants@cybertec.at> wrote:
I don't think binary search is the main problem here. Objects are
usually reasonably sized, while arrays are more likely to be huge. To
make matters worse, jsonb -> int goes from O(1) to O(n).
I don't think it's true that arrays are more likely to be huge. That
regression would be bad, but jsonb -> int is not the most compelling
operator by far. The indexable operators (in particular, @>) don't
support subscripting arrays like that, and with good reason.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus <josh@agliodbs.com> wrote:
One we ship 9.4, many users are going to load 100's of GB into JSONB
fields. Even if we fix the compressability issue in 9.5, those users
won't be able to fix the compression without rewriting all their data,
which could be prohibitive. And we'll be in a position where we have
to support the 9.4 JSONB format/compression technique for years so that
users aren't blocked from upgrading.
FWIW, if we take the delicious JSON data as representative, a table
storing that data as jsonb is 1374 MB in size. Whereas an equivalent
table with the data typed using the original json datatype (but with
white space differences more or less ignored, because it was created
using a jsonb -> json cast), the same data is 1352 MB.
Larry's complaint is valid; this is a real problem, and I'd like to
fix it before 9.4 is out. However, let us not lose sight of the fact
that JSON data is usually a poor target for TOAST compression. With
idiomatic usage, redundancy is very much more likely to appear across
rows, and not within individual Datums. Frankly, we aren't doing a
very good job there, and doing better requires an alternative
strategy.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I was not complaining; I think JSONB is awesome.
But I am one of those people who would like to put 100's of GB (or more)
JSON files into Postgres and I am concerned about file size and possible
future changes to the format.
On Fri, Aug 8, 2014 at 7:10 PM, Peter Geoghegan <pg@heroku.com> wrote:
Show quoted text
On Fri, Aug 8, 2014 at 12:06 PM, Josh Berkus <josh@agliodbs.com> wrote:
One we ship 9.4, many users are going to load 100's of GB into JSONB
fields. Even if we fix the compressability issue in 9.5, those users
won't be able to fix the compression without rewriting all their data,
which could be prohibitive. And we'll be in a position where we have
to support the 9.4 JSONB format/compression technique for years so that
users aren't blocked from upgrading.FWIW, if we take the delicious JSON data as representative, a table
storing that data as jsonb is 1374 MB in size. Whereas an equivalent
table with the data typed using the original json datatype (but with
white space differences more or less ignored, because it was created
using a jsonb -> json cast), the same data is 1352 MB.Larry's complaint is valid; this is a real problem, and I'd like to
fix it before 9.4 is out. However, let us not lose sight of the fact
that JSON data is usually a poor target for TOAST compression. With
idiomatic usage, redundancy is very much more likely to appear across
rows, and not within individual Datums. Frankly, we aren't doing a
very good job there, and doing better requires an alternative
strategy.--
Peter Geoghegan
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
I looked into the issue reported in bug #11109. The problem appears to be
that jsonb's on-disk format is designed in such a way that the leading
portion of any JSON array or object will be fairly incompressible, because
it consists mostly of a strictly-increasing series of integer offsets.
This interacts poorly with the code in pglz_compress() that gives up if
it's found nothing compressible in the first first_success_by bytes of a
value-to-be-compressed. (first_success_by is 1024 in the default set of
compression parameters.)I haven't looked at this in any detail, so take this with a grain of
salt, but what about teaching pglz_compress about using an offset
farther into the data, if the incoming data is quite a bit larger than
1k? This is just a test to see if it's worthwhile to keep going, no?Well, the point of the existing approach is that it's a *nearly free*
test to see if it's worthwhile to keep going; there's just one if-test
added in the outer loop of the compression code. (cf commit ad434473ebd2,
which added that along with some other changes.) AFAICS, what we'd have
to do to do it as you suggest would to execute compression on some subset
of the data and then throw away that work entirely. I do not find that
attractive, especially when for most datatypes there's no particular
reason to look at one subset instead of another.
Ah, I see- we were using the first block as it means we can reuse the
work done on it if we decide to continue with the compression. Makes
sense. We could possibly arrange to have the amount attempted depend on
the data type, but you point out that we can't do that without teaching
lower components about types, which is less than ideal.
What about considering how large the object is when we are analyzing if
it compresses well overall? That is- for a larger object, make a larger
effort to compress it. There's clearly a pessimistic case which could
arise from that, but it may be better than the current situation.
There's a clear risk that such an algorithm may well be very type
specific, meaning that we make things worse for some types (eg: bytea's
which end up never compressing well we'd likely spend more CPU time
trying than we do today).
1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input is very
repetitive. That's a compression-unfriendly transformation by anyone's
measure. Assuming that some future replacement for pg_lzcompress() will
nonetheless be able to compress the data strikes me as mostly wishful
thinking. Besides, we'd more than likely have a similar early-exit rule
in any substitute implementation, so that we'd still be at risk even if
it usually worked.
I agree that jsonb ends up being nonrepetitive in part, which is why
I've been trying to push the discussion in the direction of making it
more likely for the highly-compressible data to be considered rather
than the start of the jsonb object. I don't care for our compression
algorithm having to be catered to in this regard in general though as
the exact same problem could, and likely does, exist in some real life
bytea-using PG implementations.
I disagree that another algorithm wouldn't be able to manage better on
this data than pglz. pglz, from my experience, is notoriously bad a
certain data sets which other algorithms are not as poorly impacted by.
2. Are we going to ship 9.4 without fixing this? I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.
I'd really hate to ship 9.4 without a fix for this, but I have a similar
hard time with shipping 9.4 without the binary search component..
Considering all the hype that's built up around jsonb, shipping a design
with a fundamental performance handicap doesn't seem like a good plan
to me. We could perhaps band-aid around it by using different compression
parameters for jsonb, although that would require some painful API changes
since toast_compress_datum() doesn't know what datatype it's operating on.
I don't like the idea of shipping with this handicap either.
Perhaps another options would be a new storage type which basically says
"just compress it, no matter what"? We'd be able to make that the
default for jsonb columns too, no? Again- I'll admit this is shooting
from the hip, but I wanted to get these thoughts out and I won't have
much more time tonight.
Thanks!
Stephen
* Bruce Momjian (bruce@momjian.us) wrote:
On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote:
2. Are we going to ship 9.4 without fixing this? I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4
betas and report the problem JSONB columns.
That is *not* a good solution..
Thanks,
Stephen
* Josh Berkus (josh@agliodbs.com) wrote:
On 08/08/2014 08:02 AM, Tom Lane wrote:
2. Are we going to ship 9.4 without fixing this? I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.Considering all the hype that's built up around jsonb, shipping a design
with a fundamental performance handicap doesn't seem like a good plan
to me. We could perhaps band-aid around it by using different compression
parameters for jsonb, although that would require some painful API changes
since toast_compress_datum() doesn't know what datatype it's operating on.I would rather ship late than ship a noncompressable JSONB.
One we ship 9.4, many users are going to load 100's of GB into JSONB
fields. Even if we fix the compressability issue in 9.5, those users
won't be able to fix the compression without rewriting all their data,
which could be prohibitive. And we'll be in a position where we have
to support the 9.4 JSONB format/compression technique for years so that
users aren't blocked from upgrading.
Would you accept removing the binary-search capability from jsonb just
to make it compressable? I certainly wouldn't. I'd hate to ship late
also, but I'd be willing to support that if we can find a good solution
to keep both compressability and binary-search (and provided it doesn't
delay us many months..).
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
What about considering how large the object is when we are analyzing if
it compresses well overall?
Hmm, yeah, that's a possibility: we could redefine the limit at which
we bail out in terms of a fraction of the object size instead of a fixed
limit. However, that risks expending a large amount of work before we
bail, if we have a very large incompressible object --- which is not
exactly an unlikely case. Consider for example JPEG images stored as
bytea, which I believe I've heard of people doing. Another issue is
that it's not real clear that that fixes the problem for any fractional
size we'd want to use. In Larry's example of a jsonb value that fails
to compress, the header size is 940 bytes out of about 12K, so we'd be
needing to trial-compress about 10% of the object before we reach
compressible data --- and I doubt his example is worst-case.
1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input is very
repetitive. That's a compression-unfriendly transformation by anyone's
measure.
I disagree that another algorithm wouldn't be able to manage better on
this data than pglz. pglz, from my experience, is notoriously bad a
certain data sets which other algorithms are not as poorly impacted by.
Well, I used to be considered a compression expert, and I'm going to
disagree with you here. It's surely possible that other algorithms would
be able to get some traction where pglz fails to get any, but that doesn't
mean that presenting them with hard-to-compress data in the first place is
a good design decision. There is no scenario in which data like this is
going to be friendly to a general-purpose compression algorithm. It'd
be necessary to have explicit knowledge that the data consists of an
increasing series of four-byte integers to be able to do much with it.
And then such an assumption would break down once you got past the
header ...
Perhaps another options would be a new storage type which basically says
"just compress it, no matter what"? We'd be able to make that the
default for jsonb columns too, no?
Meh. We could do that, but it would still require adding arguments to
toast_compress_datum() that aren't there now. In any case, this is a
band-aid solution; and as Josh notes, once we ship 9.4 we are going to
be stuck with jsonb's on-disk representation pretty much forever.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/08/2014 08:45 PM, Tom Lane wrote:
Perhaps another options would be a new storage type which basically says
"just compress it, no matter what"? We'd be able to make that the
default for jsonb columns too, no?Meh. We could do that, but it would still require adding arguments to
toast_compress_datum() that aren't there now. In any case, this is a
band-aid solution; and as Josh notes, once we ship 9.4 we are going to
be stuck with jsonb's on-disk representation pretty much forever.
Yeah, and almost any other solution is likely to mean non-jsonb users
potentially paying a penalty for fixing this for jsonb. So if we can
adjust the jsonb layout to fix this problem I think we should do so.
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
What about considering how large the object is when we are analyzing if
it compresses well overall?Hmm, yeah, that's a possibility: we could redefine the limit at which
we bail out in terms of a fraction of the object size instead of a fixed
limit. However, that risks expending a large amount of work before we
bail, if we have a very large incompressible object --- which is not
exactly an unlikely case. Consider for example JPEG images stored as
bytea, which I believe I've heard of people doing. Another issue is
that it's not real clear that that fixes the problem for any fractional
size we'd want to use. In Larry's example of a jsonb value that fails
to compress, the header size is 940 bytes out of about 12K, so we'd be
needing to trial-compress about 10% of the object before we reach
compressible data --- and I doubt his example is worst-case.
Agreed- I tried to allude to that in my prior mail, there's clearly a
concern that we'd make things worse in certain situations. Then again,
at least for that case, we could recommend changing the storage type to
EXTERNAL.
1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input is very
repetitive. That's a compression-unfriendly transformation by anyone's
measure.I disagree that another algorithm wouldn't be able to manage better on
this data than pglz. pglz, from my experience, is notoriously bad a
certain data sets which other algorithms are not as poorly impacted by.Well, I used to be considered a compression expert, and I'm going to
disagree with you here. It's surely possible that other algorithms would
be able to get some traction where pglz fails to get any, but that doesn't
mean that presenting them with hard-to-compress data in the first place is
a good design decision. There is no scenario in which data like this is
going to be friendly to a general-purpose compression algorithm. It'd
be necessary to have explicit knowledge that the data consists of an
increasing series of four-byte integers to be able to do much with it.
And then such an assumption would break down once you got past the
header ...
I've wondered previously as to if we, perhaps, missed the boat pretty
badly by not providing an explicitly versioned per-type compression
capability, such that we wouldn't be stuck with one compression
algorith for all types, and would be able to version compression types
in a way that would allow us to change them over time, provided the
newer code always understands how to decode X-4 (or whatever) versions
back.
I do agree that it'd be great to represent every type in a highly
compressable way for the sake of the compression algorithm, but
I've not seen any good suggestions for how to make that happen and I've
got a hard time seeing how we could completely change the jsonb storage
format, retain the capabilities it has today, make it highly
compressible, and get 9.4 out this calendar year.
I expect we could trivially add padding into the jsonb header to make it
compress better, for the sake of this particular check, but then we're
going to always be compression jsonb, even when the user data isn't
actually terribly good for compression, spending a good bit of CPU time
while we're at it.
Perhaps another options would be a new storage type which basically says
"just compress it, no matter what"? We'd be able to make that the
default for jsonb columns too, no?Meh. We could do that, but it would still require adding arguments to
toast_compress_datum() that aren't there now. In any case, this is a
band-aid solution; and as Josh notes, once we ship 9.4 we are going to
be stuck with jsonb's on-disk representation pretty much forever.
I agree that we need to avoid changing jsonb's on-disk representation.
Have I missed where a good suggestion has been made about how to do that
which preserves the binary-search capabilities and doesn't make the code
much more difficult? Trying to move the header to the end just for the
sake of this doesn't strike me as a good solution as it'll make things
quite a bit more complicated. Is there a way we could interleave the
likely-compressible user data in with the header instead? I've not
looked, but it seems like that's the only reasonable approach to address
this issue in this manner. If that's simply done, then great, but it
strikes me as unlikely to be..
I'll just throw out a bit of a counter-point to all this also though- we
don't try to focus on making our on-disk representation of data,
generally, very compressible even though there are filesystems, such as
ZFS, which might benefit from certain rearrangements of our on-disk
formats (no, I don't have any specific recommendations in this vein, but
I certainly don't see anyone else asking after it or asking for us to be
concerned about it). Compression is great and I'd hate to see us have a
format that will just work with it even though it might be beneficial in
many cases, but I feel the fault here is with the compression algorithm
and the decisions made as part of that operation and not really with
this particular data structure.
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
I agree that we need to avoid changing jsonb's on-disk representation.
... post-release, I assume you mean.
Have I missed where a good suggestion has been made about how to do that
which preserves the binary-search capabilities and doesn't make the code
much more difficult?
We don't have one yet, but we've only been thinking about this for a few
hours.
Trying to move the header to the end just for the
sake of this doesn't strike me as a good solution as it'll make things
quite a bit more complicated. Is there a way we could interleave the
likely-compressible user data in with the header instead?
Yeah, I was wondering about that too, but I don't immediately see how to
do it without some sort of preprocessing step when we read the object
(which'd be morally equivalent to converting a series of lengths into a
pointer array). Binary search isn't going to work if the items it's
searching in aren't all the same size.
Having said that, I am not sure that a preprocessing step is a
deal-breaker. It'd be O(N), but with a pretty darn small constant factor,
and for plausible sizes of objects I think the binary search might still
dominate. Worth investigation perhaps.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Stephen Frost <sfrost@snowman.net> writes:
I agree that we need to avoid changing jsonb's on-disk representation.
... post-release, I assume you mean.
Yes.
Have I missed where a good suggestion has been made about how to do that
which preserves the binary-search capabilities and doesn't make the code
much more difficult?We don't have one yet, but we've only been thinking about this for a few
hours.
Fair enough.
Trying to move the header to the end just for the
sake of this doesn't strike me as a good solution as it'll make things
quite a bit more complicated. Is there a way we could interleave the
likely-compressible user data in with the header instead?Yeah, I was wondering about that too, but I don't immediately see how to
do it without some sort of preprocessing step when we read the object
(which'd be morally equivalent to converting a series of lengths into a
pointer array). Binary search isn't going to work if the items it's
searching in aren't all the same size.Having said that, I am not sure that a preprocessing step is a
deal-breaker. It'd be O(N), but with a pretty darn small constant factor,
and for plausible sizes of objects I think the binary search might still
dominate. Worth investigation perhaps.
For my part, I'm less concerned about a preprocessing step which happens
when we store the data and more concerned about ensuring that we're able
to extract data quickly. Perhaps that's simply because I'm used to
writes being more expensive than reads, but I'm not alone in that
regard either. I doubt I'll have time in the next couple of weeks to
look into this and if we're going to want this change for 9.4, we really
need someone working on it sooner than later. (to the crowd)- do we
have any takers for this investigation?
Thanks,
Stephen
On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Stephen Frost <sfrost@snowman.net> writes:
What about considering how large the object is when we are analyzing if
it compresses well overall?Hmm, yeah, that's a possibility: we could redefine the limit at which
we bail out in terms of a fraction of the object size instead of a fixed
limit. However, that risks expending a large amount of work before we
bail, if we have a very large incompressible object --- which is not
exactly an unlikely case. Consider for example JPEG images stored as
bytea, which I believe I've heard of people doing. Another issue is
that it's not real clear that that fixes the problem for any fractional
size we'd want to use. In Larry's example of a jsonb value that fails
to compress, the header size is 940 bytes out of about 12K, so we'd be
needing to trial-compress about 10% of the object before we reach
compressible data --- and I doubt his example is worst-case.1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input is
very
repetitive. That's a compression-unfriendly transformation by anyone's
measure.I disagree that another algorithm wouldn't be able to manage better on
this data than pglz. pglz, from my experience, is notoriously bad a
certain data sets which other algorithms are not as poorly impacted by.Well, I used to be considered a compression expert, and I'm going to
disagree with you here. It's surely possible that other algorithms would
be able to get some traction where pglz fails to get any,
During my previous work in this area, I had seen that some algorithms
use skipping logic which can be useful for incompressible data followed
by compressible data or in general as well. One of the technique could
be If we don't find any match for first 4 bytes, then skip 4 bytes
and if we don't find match again for next 8 bytes, then skip 8
bytes and keep on doing the same until we find first match in which
case it would go back to beginning of data. Now here we could follow
this logic until we actually compare total of first_success_by bytes.
There can be caveats in this particular scheme of skipping but I
just wanted to mention in general about the skipping idea to reduce
the number of situations where we will bail out even though there is
lot of compressible data.
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
On Fri, Aug 8, 2014 at 08:25:04PM -0400, Stephen Frost wrote:
* Bruce Momjian (bruce@momjian.us) wrote:
On Fri, Aug 8, 2014 at 11:02:26AM -0400, Tom Lane wrote:
2. Are we going to ship 9.4 without fixing this? I definitely don't see
replacing pg_lzcompress as being on the agenda for 9.4, whereas changing
jsonb is still within the bounds of reason.FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4
betas and report the problem JSONB columns.That is *not* a good solution..
If you change the JSONB binary format, and we can't read the old format,
that is the only option.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
akapila wrote
On Sat, Aug 9, 2014 at 6:15 AM, Tom Lane <
tgl@.pa
> wrote:
Stephen Frost <
sfrost@
> writes:
What about considering how large the object is when we are analyzing if
it compresses well overall?Hmm, yeah, that's a possibility: we could redefine the limit at which
we bail out in terms of a fraction of the object size instead of a fixed
limit. However, that risks expending a large amount of work before we
bail, if we have a very large incompressible object --- which is not
exactly an unlikely case. Consider for example JPEG images stored as
bytea, which I believe I've heard of people doing. Another issue is
that it's not real clear that that fixes the problem for any fractional
size we'd want to use. In Larry's example of a jsonb value that fails
to compress, the header size is 940 bytes out of about 12K, so we'd be
needing to trial-compress about 10% of the object before we reach
compressible data --- and I doubt his example is worst-case.1. The real problem here is that jsonb is emitting quite a bit of
fundamentally-nonrepetitive data, even when the user-visible input isvery
repetitive. That's a compression-unfriendly transformation by
anyone's
measure.
I disagree that another algorithm wouldn't be able to manage better on
this data than pglz. pglz, from my experience, is notoriously bad a
certain data sets which other algorithms are not as poorly impacted by.Well, I used to be considered a compression expert, and I'm going to
disagree with you here. It's surely possible that other algorithms would
be able to get some traction where pglz fails to get any,During my previous work in this area, I had seen that some algorithms
use skipping logic which can be useful for incompressible data followed
by compressible data or in general as well.
Random thought from the sideline...
This particular data type has the novel (within PostgreSQL) design of both a
(feature oriented - and sizeable) header and a payload. Is there some way
to add that model into the storage system so that, at a higher level,
separate attempts are made to compress each section and then the compressed
(or not) results and written out adjacently and with a small header
indicating the length of the stored header and other meta data like whether
each part is compressed and even the type that data represents? For reading
back into memory the header-payload generic type is populated and then the
header and payload decompressed - as needed - then the two parts are fed
into the appropriate type constructor that understands and accepts the two
pieces.
Just hoping to spark an idea here - I don't know enough about the internals
to even guess how close I am to something feasible.
David J.
--
View this message in context: http://postgresql.1045698.n5.nabble.com/jsonb-format-is-pessimal-for-toast-compression-tp5814162p5814299.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce,
* Bruce Momjian (bruce@momjian.us) wrote:
On Fri, Aug 8, 2014 at 08:25:04PM -0400, Stephen Frost wrote:
* Bruce Momjian (bruce@momjian.us) wrote:
FYI, pg_upgrade could be taught to refuse to upgrade from earlier 9.4
betas and report the problem JSONB columns.That is *not* a good solution..
If you change the JSONB binary format, and we can't read the old format,
that is the only option.
Apologies- I had thought you were suggesting this for a 9.4 -> 9.5
conversion, not for just 9.4beta to 9.4. Adding that to pg_upgrade to
address folks upgrading from betas would certainly be fine.
Thanks,
Stephen
Tom Lane <tgl@sss.pgh.pa.us> wrote:
Stephen Frost <sfrost@snowman.net> writes:
Trying to move the header to the end just for the sake of this
doesn't strike me as a good solution as it'll make things quite
a bit more complicated.
Why is that? How much harder would it be to add a single offset
field to the front to point to the part we're shifting to the end?
It is not all that unusual to put a directory at the end, like in
the .zip file format.
Is there a way we could interleave the likely-compressible user
data in with the header instead?Yeah, I was wondering about that too, but I don't immediately see
how to do it without some sort of preprocessing step when we read
the object (which'd be morally equivalent to converting a series
of lengths into a pointer array).
That sounds far more complex and fragile than just moving the
indexes to the end.
--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Kevin Grittner <kgrittn@ymail.com> writes:
Stephen Frost <sfrost@snowman.net> writes:
Trying to move the header to the end just for the sake of this
doesn't strike me as a good solution as it'll make things quite
a bit more complicated.
Why is that?� How much harder would it be to add a single offset
field to the front to point to the part we're shifting to the end?
It is not all that unusual to put a directory at the end, like in
the .zip file format.
Yeah, I was wondering that too. Arguably, directory-at-the-end would
be easier to work with for on-the-fly creation, not that we do any
such thing at the moment. I think the main thing that's bugging Stephen
is that doing that just to make pglz_compress happy seems like a kluge
(and I have to agree).
Here's a possibly more concrete thing to think about: we may very well
someday want to support JSONB object field or array element extraction
without reading all blocks of a large toasted JSONB value, if the value is
stored external without compression. We already went to the trouble of
creating analogous logic for substring extraction from a long uncompressed
text or bytea value, so I think this is a plausible future desire. With
the current format you could imagine grabbing the first TOAST chunk, and
then if you see the header is longer than that you can grab the remainder
of the header without any wasted I/O, and for the array-subscripting case
you'd now have enough info to fetch the element value from the body of
the JSONB without any wasted I/O. With directory-at-the-end you'd
have to read the first chunk just to get the directory pointer, and this
would most likely not give you any of the directory proper; but at least
you'd know exactly how big the directory is before you go to read it in.
The former case is probably slightly better. However, if you're doing an
object key lookup not an array element fetch, neither of these formats are
really friendly at all, because each binary-search probe probably requires
bringing in one or two toast chunks from the body of the JSONB value so
you can look at the key text. I'm not sure if there's a way to redesign
the format to make that less painful/expensive --- but certainly, having
the key texts scattered through the JSONB value doesn't seem like a great
thing from this standpoint.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Aug 9, 2014 at 3:51 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Kevin Grittner <kgrittn@ymail.com> writes:
Stephen Frost <sfrost@snowman.net> writes:
Trying to move the header to the end just for the sake of this
doesn't strike me as a good solution as it'll make things quite
a bit more complicated.Why is that? How much harder would it be to add a single offset
field to the front to point to the part we're shifting to the end?
It is not all that unusual to put a directory at the end, like in
the .zip file format.Yeah, I was wondering that too. Arguably, directory-at-the-end would
be easier to work with for on-the-fly creation, not that we do any
such thing at the moment. I think the main thing that's bugging Stephen
is that doing that just to make pglz_compress happy seems like a kluge
(and I have to agree).Here's a possibly more concrete thing to think about: we may very well
someday want to support JSONB object field or array element extraction
without reading all blocks of a large toasted JSONB value, if the value is
stored external without compression. We already went to the trouble of
creating analogous logic for substring extraction from a long uncompressed
text or bytea value, so I think this is a plausible future desire. With
the current format you could imagine grabbing the first TOAST chunk, and
then if you see the header is longer than that you can grab the remainder
of the header without any wasted I/O, and for the array-subscripting case
you'd now have enough info to fetch the element value from the body of
the JSONB without any wasted I/O. With directory-at-the-end you'd
have to read the first chunk just to get the directory pointer, and this
would most likely not give you any of the directory proper; but at least
you'd know exactly how big the directory is before you go to read it in.
The former case is probably slightly better. However, if you're doing an
object key lookup not an array element fetch, neither of these formats are
really friendly at all, because each binary-search probe probably requires
bringing in one or two toast chunks from the body of the JSONB value so
you can look at the key text. I'm not sure if there's a way to redesign
the format to make that less painful/expensive --- but certainly, having
the key texts scattered through the JSONB value doesn't seem like a great
thing from this standpoint.
I think that's a good point.
On the general topic, I don't think it's reasonable to imagine that
we're going to come up with a single heuristic that works well for
every kind of input data. What pglz is doing - assuming that if the
beginning of the data is incompressible then the rest probably is too
- is fundamentally reasonable, nonwithstanding the fact that it
doesn't happen to work out well for JSONB. We might be able to tinker
with that general strategy in some way that seems to fix this case and
doesn't appear to break others, but there's some risk in that, and
there's no obvious reason in my mind why PGLZ should be require to fly
blind. So I think it would be a better idea to arrange some method by
which JSONB (and perhaps other data types) can provide compression
hints to pglz.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
... I think it would be a better idea to arrange some method by
which JSONB (and perhaps other data types) can provide compression
hints to pglz.
I agree with that as a long-term goal, but not sure if it's sane to
push into 9.4.
What we could conceivably do now is (a) add a datatype OID argument to
toast_compress_datum, and (b) hard-wire the selection of a different
compression-parameters struct if it's JSONBOID. The actual fix would
then be to increase the first_success_by field of this alternate struct.
I had been worrying about API-instability risks associated with (a),
but on reflection it seems unlikely that any third-party code calls
toast_compress_datum directly, and anyway it's not something we'd
be back-patching to before 9.4.
The main objection to (b) is that it wouldn't help for domains over jsonb
(and no, I don't want to insert a getBaseType call there to fix that).
A longer-term solution would be to make this some sort of type property
that domains could inherit, like typstorage is already. (Somebody
suggested dealing with this by adding more typstorage values, but
I don't find that attractive; typstorage is known in too many places.)
We'd need some thought about exactly what we want to expose, since
the specific knobs that pglz_compress has today aren't necessarily
good for the long term.
This is all kinda ugly really, but since I'm not hearing brilliant
ideas for redesigning jsonb's storage format, maybe this is the
best we can do for now.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 11, 2014 at 12:07 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I think that's a good point.
I think that there may be something to be said for the current layout.
Having adjacent keys and values could take better advantage of CPU
cache characteristics. I've heard of approaches to improving B-Tree
locality that forced keys and values to be adjacent on individual
B-Tree pages [1]http://www.vldb.org/conf/1999/P7.pdf , "We also forced each key and child pointer to be adjacent to each other physically" -- Peter Geoghegan, for example. I've heard of this more than once. And
FWIW, I believe based on earlier research of user requirements in this
area that very large jsonb datums are not considered all that
compelling. Document database systems have considerable limitations
here.
On the general topic, I don't think it's reasonable to imagine that
we're going to come up with a single heuristic that works well for
every kind of input data. What pglz is doing - assuming that if the
beginning of the data is incompressible then the rest probably is too
- is fundamentally reasonable, nonwithstanding the fact that it
doesn't happen to work out well for JSONB. We might be able to tinker
with that general strategy in some way that seems to fix this case and
doesn't appear to break others, but there's some risk in that, and
there's no obvious reason in my mind why PGLZ should be require to fly
blind. So I think it would be a better idea to arrange some method by
which JSONB (and perhaps other data types) can provide compression
hints to pglz.
If there is to be any effort to make jsonb a more effective target for
compression, I imagine that that would have to target redundancy
between JSON documents. With idiomatic usage, we can expect plenty of
it.
[1]: http://www.vldb.org/conf/1999/P7.pdf , "We also forced each key and child pointer to be adjacent to each other physically" -- Peter Geoghegan
and child pointer to be adjacent to each other physically"
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... I think it would be a better idea to arrange some method by
which JSONB (and perhaps other data types) can provide compression
hints to pglz.I agree with that as a long-term goal, but not sure if it's sane to
push into 9.4.
Agreed.
What we could conceivably do now is (a) add a datatype OID argument to
toast_compress_datum, and (b) hard-wire the selection of a different
compression-parameters struct if it's JSONBOID. The actual fix would
then be to increase the first_success_by field of this alternate struct.
Isn't the offset-to-compressable-data variable though, depending on the
number of keys, etc? Would we be increasing first_success_by based off
of some function which inspects the object?
I had been worrying about API-instability risks associated with (a),
but on reflection it seems unlikely that any third-party code calls
toast_compress_datum directly, and anyway it's not something we'd
be back-patching to before 9.4.
Agreed.
The main objection to (b) is that it wouldn't help for domains over jsonb
(and no, I don't want to insert a getBaseType call there to fix that).
While not ideal, that seems like an acceptable compromise for 9.4 to me.
A longer-term solution would be to make this some sort of type property
that domains could inherit, like typstorage is already. (Somebody
suggested dealing with this by adding more typstorage values, but
I don't find that attractive; typstorage is known in too many places.)
Think that was me and having it be something which domains can inherit
makes sense. Would be able to use this approach to introduce type
(and domains inheirited from that type) specific compression algorithms,
perhaps? Or even get to a point where we could have a chunk-based
compression scheme for certain types of objects (such as JSONB) where we
keep track of which keys exist at which points in the compressed object,
allowing us to skip to the specific chunk which contains the requested
key, similar to what we do with uncompressed data?
We'd need some thought about exactly what we want to expose, since
the specific knobs that pglz_compress has today aren't necessarily
good for the long term.
Agreed.
This is all kinda ugly really, but since I'm not hearing brilliant
ideas for redesigning jsonb's storage format, maybe this is the
best we can do for now.
This would certainly be an improvement over what's going on now, and I
love the idea of possibly being able to expand this in the future to do
more. What I'd hate to see is having all of this and only ever using it
to say "skip ahead another 1k for JSONB".
Thanks,
Stephen
Stephen Frost <sfrost@snowman.net> writes:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
What we could conceivably do now is (a) add a datatype OID argument to
toast_compress_datum, and (b) hard-wire the selection of a different
compression-parameters struct if it's JSONBOID. The actual fix would
then be to increase the first_success_by field of this alternate struct.
Isn't the offset-to-compressable-data variable though, depending on the
number of keys, etc? Would we be increasing first_success_by based off
of some function which inspects the object?
Given that this is a short-term hack, I'd be satisfied with setting it
to INT_MAX.
If we got more ambitious, we could consider improving the cutoff logic
so that it gives up at "x% of the object or n bytes, whichever comes
first"; but I'd want to see some hard evidence that that was useful
before adding any more cycles to pglz_compress.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
* Peter Geoghegan (pg@heroku.com) wrote:
If there is to be any effort to make jsonb a more effective target for
compression, I imagine that that would have to target redundancy
between JSON documents. With idiomatic usage, we can expect plenty of
it.
While I certainly agree, that's a rather different animal to address and
doesn't hold a lot of relevance to the current problem. Or, to put it
another way, I don't think anyone is going to be surprised that two rows
containing the same data (even if they're inserted in the same
transaction and have the same visibility information) are compressed
together in some fashion.
We've got a clear example of someone, quite reasonably, expecting their
JSONB object to be compressed using the normal TOAST mechanism, and
we're failing to do that in cases where it's actually a win to do so.
That's the focus of this discussion and what needs to be addressed
before 9.4 goes out.
Thanks,
Stephen
On Mon, Aug 11, 2014 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:
We've got a clear example of someone, quite reasonably, expecting their
JSONB object to be compressed using the normal TOAST mechanism, and
we're failing to do that in cases where it's actually a win to do so.
That's the focus of this discussion and what needs to be addressed
before 9.4 goes out.
Sure. I'm not trying to minimize that. We should fix it, certainly.
However, it does bear considering that JSON data, with each document
stored in a row is not an effective target for TOAST compression in
general, even as text.
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Fri, Aug 8, 2014 at 10:50 PM, Hannu Krosing <hannu@2ndquadrant.com> wrote:
How hard and how expensive would it be to teach pg_lzcompress to
apply a delta filter on suitable data ?So that instead of integers their deltas will be fed to the "real"
compressor
Has anyone given this more thought? I know this might not be 9.4
material, but to me it sounds like the most promising approach, if
it's workable. This isn't a made up thing, the 7z and LZMA formats
also have an optional delta filter.
Of course with JSONB the problem is figuring out which parts to apply
the delta filter to, and which parts not.
This would also help with integer arrays, containing for example
foreign key values to a serial column. There's bound to be some
redundancy, as nearby serial values are likely to end up close
together. In one of my past projects we used to store large arrays of
integer fkeys, deliberately sorted for duplicate elimination.
For an ideal case comparison, intar2 could be as large as intar1 when
compressed with a 4-byte wide delta filter:
create table intar1 as select array(select 1::int from
generate_series(1,1000000)) a;
create table intar2 as select array(select generate_series(1,1000000)::int) a;
In PostgreSQL 9.3 the sizes are:
select pg_column_size(a) from intar1;
45810
select pg_column_size(a) from intar2;
4000020
So a factor of 87 difference.
Regards,
Marti
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 11, 2014 at 3:35 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Robert Haas <robertmhaas@gmail.com> writes:
... I think it would be a better idea to arrange some method by
which JSONB (and perhaps other data types) can provide compression
hints to pglz.I agree with that as a long-term goal, but not sure if it's sane to
push into 9.4.What we could conceivably do now is (a) add a datatype OID argument to
toast_compress_datum, and (b) hard-wire the selection of a different
compression-parameters struct if it's JSONBOID. The actual fix would
then be to increase the first_success_by field of this alternate struct.
I think it would be perfectly sane to do that for 9.4. It may not be
perfect, but neither is what we have now.
A longer-term solution would be to make this some sort of type property
that domains could inherit, like typstorage is already. (Somebody
suggested dealing with this by adding more typstorage values, but
I don't find that attractive; typstorage is known in too many places.)
We'd need some thought about exactly what we want to expose, since
the specific knobs that pglz_compress has today aren't necessarily
good for the long term.
What would really be ideal here is if the JSON code could inform the
toast compression code "this many initial bytes are likely
incompressible, just pass them through without trying, and then start
compressing at byte N", where N is the byte following the TOC. But I
don't know that there's a reasonable way to implement that.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Aug 11, 2014 at 01:44:05PM -0700, Peter Geoghegan wrote:
On Mon, Aug 11, 2014 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:
We've got a clear example of someone, quite reasonably, expecting their
JSONB object to be compressed using the normal TOAST mechanism, and
we're failing to do that in cases where it's actually a win to do so.
That's the focus of this discussion and what needs to be addressed
before 9.4 goes out.Sure. I'm not trying to minimize that. We should fix it, certainly.
However, it does bear considering that JSON data, with each document
stored in a row is not an effective target for TOAST compression in
general, even as text.
Seems we have two issues:
1) the header makes testing for compression likely to fail
2) use of pointers rather than offsets reduces compression potential
I understand we are focusing on #1, but how much does compression reduce
the storage size with and without #2? Seems we need to know that answer
before deciding if it is worth reducing the ability to do fast lookups
with #2.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Aug 12, 2014 at 8:00 PM, Bruce Momjian <bruce@momjian.us> wrote:
On Mon, Aug 11, 2014 at 01:44:05PM -0700, Peter Geoghegan wrote:
On Mon, Aug 11, 2014 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:
We've got a clear example of someone, quite reasonably, expecting their
JSONB object to be compressed using the normal TOAST mechanism, and
we're failing to do that in cases where it's actually a win to do so.
That's the focus of this discussion and what needs to be addressed
before 9.4 goes out.Sure. I'm not trying to minimize that. We should fix it, certainly.
However, it does bear considering that JSON data, with each document
stored in a row is not an effective target for TOAST compression in
general, even as text.Seems we have two issues:
1) the header makes testing for compression likely to fail
2) use of pointers rather than offsets reduces compression potential
I do think the best solution for 2 is what's been proposed already, to
do delta-coding of the pointers in chunks (ie, 1 pointer, 15 deltas,
repeat).
But it does make binary search quite more complex.
Alternatively, it could be somewhat compressed as follows:
Segment = 1 pointer head, 15 deltas
Pointer head = pointers[0]
delta[i] = pointers[i] - pointers[0] for i in 1..15
(delta to segment head, not previous value)
Now, you can have 4 types of segments. 8, 16, 32, 64 bits, which is
the size of the deltas. You achieve between 8x and 1x compression, and
even when 1x (no compression), you make it easier for pglz to find
something compressible.
Accessing it is also simple, if you have a segment index (tough part here).
Replace the 15 for something that makes such segment index very compact ;)
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian <bruce@momjian.us> writes:
Seems we have two issues:
1) the header makes testing for compression likely to fail
2) use of pointers rather than offsets reduces compression potential
I understand we are focusing on #1, but how much does compression reduce
the storage size with and without #2? Seems we need to know that answer
before deciding if it is worth reducing the ability to do fast lookups
with #2.
That's a fair question. I did a very very simple hack to replace the item
offsets with item lengths -- turns out that that mostly requires removing
some code that changes lengths to offsets ;-). I then loaded up Larry's
example of a noncompressible JSON value, and compared pg_column_size()
which is just about the right thing here since it reports datum size after
compression. Remembering that the textual representation is 12353 bytes:
json: 382 bytes
jsonb, using offsets: 12593 bytes
jsonb, using lengths: 406 bytes
So this confirms my suspicion that the choice of offsets not lengths
is what's killing compressibility. If it used lengths, jsonb would be
very nearly as compressible as the original text.
Hack attached in case anyone wants to collect more thorough statistics.
We'd not actually want to do it like this because of the large expense
of recomputing the offsets on-demand all the time. (It does pass the
regression tests, for what that's worth.)
regards, tom lane
Attachments:
jsonb-with-lengths-hack.patchtext/x-diff; charset=us-ascii; name=jsonb-with-lengths-hack.patchDownload
diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index 04f35bf..2297504 100644
*** a/src/backend/utils/adt/jsonb_util.c
--- b/src/backend/utils/adt/jsonb_util.c
*************** convertJsonbArray(StringInfo buffer, JEn
*** 1378,1385 ****
errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
JENTRY_POSMASK)));
- if (i > 0)
- meta = (meta & ~JENTRY_POSMASK) | totallen;
copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
metaoffset += sizeof(JEntry);
}
--- 1378,1383 ----
*************** convertJsonbObject(StringInfo buffer, JE
*** 1430,1437 ****
errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
JENTRY_POSMASK)));
- if (i > 0)
- meta = (meta & ~JENTRY_POSMASK) | totallen;
copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
metaoffset += sizeof(JEntry);
--- 1428,1433 ----
*************** convertJsonbObject(StringInfo buffer, JE
*** 1445,1451 ****
errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
JENTRY_POSMASK)));
- meta = (meta & ~JENTRY_POSMASK) | totallen;
copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
metaoffset += sizeof(JEntry);
}
--- 1441,1446 ----
*************** uniqueifyJsonbObject(JsonbValue *object)
*** 1592,1594 ****
--- 1587,1600 ----
object->val.object.nPairs = res + 1 - object->val.object.pairs;
}
}
+
+ uint32
+ jsonb_get_offset(const JEntry *ja, int index)
+ {
+ uint32 off = 0;
+ int i;
+
+ for (i = 0; i < index; i++)
+ off += JBE_LEN(ja, i);
+ return off;
+ }
diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h
index 5f2594b..c9b18e1 100644
*** a/src/include/utils/jsonb.h
--- b/src/include/utils/jsonb.h
*************** typedef uint32 JEntry;
*** 153,162 ****
* Macros for getting the offset and length of an element. Note multiple
* evaluations and access to prior array element.
*/
! #define JBE_ENDPOS(je_) ((je_) & JENTRY_POSMASK)
! #define JBE_OFF(ja, i) ((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1]))
! #define JBE_LEN(ja, i) ((i) == 0 ? JBE_ENDPOS((ja)[i]) \
! : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1]))
/*
* A jsonb array or object node, within a Jsonb Datum.
--- 153,163 ----
* Macros for getting the offset and length of an element. Note multiple
* evaluations and access to prior array element.
*/
! #define JBE_LENFLD(je_) ((je_) & JENTRY_POSMASK)
! #define JBE_OFF(ja, i) jsonb_get_offset(ja, i)
! #define JBE_LEN(ja, i) JBE_LENFLD((ja)[i])
!
! extern uint32 jsonb_get_offset(const JEntry *ja, int index);
/*
* A jsonb array or object node, within a Jsonb Datum.
I wrote:
That's a fair question. I did a very very simple hack to replace the item
offsets with item lengths -- turns out that that mostly requires removing
some code that changes lengths to offsets ;-). I then loaded up Larry's
example of a noncompressible JSON value, and compared pg_column_size()
which is just about the right thing here since it reports datum size after
compression. Remembering that the textual representation is 12353 bytes:
json: 382 bytes
jsonb, using offsets: 12593 bytes
jsonb, using lengths: 406 bytes
Oh, one more result: if I leave the representation alone, but change
the compression parameters to set first_success_by to INT_MAX, this
value takes up 1397 bytes. So that's better, but still more than a
3X penalty compared to using lengths. (Admittedly, this test value
probably is an outlier compared to normal practice, since it's a hundred
or so repetitions of the same two strings.)
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/13/2014 09:01 PM, Tom Lane wrote:
I wrote:
That's a fair question. I did a very very simple hack to replace the item
offsets with item lengths -- turns out that that mostly requires removing
some code that changes lengths to offsets ;-). I then loaded up Larry's
example of a noncompressible JSON value, and compared pg_column_size()
which is just about the right thing here since it reports datum size after
compression. Remembering that the textual representation is 12353 bytes:
json: 382 bytes
jsonb, using offsets: 12593 bytes
jsonb, using lengths: 406 bytesOh, one more result: if I leave the representation alone, but change
the compression parameters to set first_success_by to INT_MAX, this
value takes up 1397 bytes. So that's better, but still more than a
3X penalty compared to using lengths. (Admittedly, this test value
probably is an outlier compared to normal practice, since it's a hundred
or so repetitions of the same two strings.)
What does changing to lengths do to the speed of other operations?
cheers
andrew
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andrew Dunstan <andrew@dunslane.net> writes:
On 08/13/2014 09:01 PM, Tom Lane wrote:
That's a fair question. I did a very very simple hack to replace the item
offsets with item lengths -- turns out that that mostly requires removing
some code that changes lengths to offsets ;-).
What does changing to lengths do to the speed of other operations?
This was explicitly *not* an attempt to measure the speed issue. To do
a fair trial of that, you'd have to work a good bit harder, methinks.
Examining each of N items would involve O(N^2) work with the patch as
posted, but presumably you could get it down to less in most or all
cases --- in particular, sequential traversal could be done with little
added cost. But it'd take a lot more hacking.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/14/2014 04:01 AM, Tom Lane wrote:
I wrote:
That's a fair question. I did a very very simple hack to replace the item
offsets with item lengths -- turns out that that mostly requires removing
some code that changes lengths to offsets ;-). I then loaded up Larry's
example of a noncompressible JSON value, and compared pg_column_size()
which is just about the right thing here since it reports datum size after
compression. Remembering that the textual representation is 12353 bytes:json: 382 bytes
jsonb, using offsets: 12593 bytes
jsonb, using lengths: 406 bytesOh, one more result: if I leave the representation alone, but change
the compression parameters to set first_success_by to INT_MAX, this
value takes up 1397 bytes. So that's better, but still more than a
3X penalty compared to using lengths. (Admittedly, this test value
probably is an outlier compared to normal practice, since it's a hundred
or so repetitions of the same two strings.)
For comparison, here's a patch that implements the scheme that Alexander
Korotkov suggested, where we store an offset every 8th element, and a
length in the others. It compresses Larry's example to 525 bytes.
Increasing the "stride" from 8 to 16 entries, it compresses to 461 bytes.
A nice thing about this patch is that it's on-disk compatible with the
current format, hence initdb is not required.
(The current comments claim that the first element in an array always
has the JENTRY_ISFIRST flags set; that is wrong, there is no such flag.
I removed the flag in commit d9daff0e, but apparently failed to update
the comment and the accompanying JBE_ISFIRST macro. Sorry about that,
will fix. This patch uses the bit that used to be JENTRY_ISFIRST to mark
entries that store a length instead of an end offset.).
- Heikki
Attachments:
jsonb-with-offsets-and-lengths-1.patchtext/x-diff; name=jsonb-with-offsets-and-lengths-1.patchDownload
diff --git a/src/backend/utils/adt/jsonb_util.c b/src/backend/utils/adt/jsonb_util.c
index 04f35bf..47b2998 100644
--- a/src/backend/utils/adt/jsonb_util.c
+++ b/src/backend/utils/adt/jsonb_util.c
@@ -1378,8 +1378,10 @@ convertJsonbArray(StringInfo buffer, JEntry *pheader, JsonbValue *val, int level
errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
JENTRY_POSMASK)));
- if (i > 0)
+ if (i % JBE_STORE_LEN_STRIDE == 0)
meta = (meta & ~JENTRY_POSMASK) | totallen;
+ else
+ meta |= JENTRY_HAS_LEN;
copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
metaoffset += sizeof(JEntry);
}
@@ -1430,11 +1432,14 @@ convertJsonbObject(StringInfo buffer, JEntry *pheader, JsonbValue *val, int leve
errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
JENTRY_POSMASK)));
- if (i > 0)
+ if (i % JBE_STORE_LEN_STRIDE == 0)
meta = (meta & ~JENTRY_POSMASK) | totallen;
+ else
+ meta |= JENTRY_HAS_LEN;
copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
metaoffset += sizeof(JEntry);
+ /* put value */
convertJsonbValue(buffer, &meta, &pair->value, level);
len = meta & JENTRY_POSMASK;
totallen += len;
@@ -1445,7 +1450,7 @@ convertJsonbObject(StringInfo buffer, JEntry *pheader, JsonbValue *val, int leve
errmsg("total size of jsonb array elements exceeds the maximum of %u bytes",
JENTRY_POSMASK)));
- meta = (meta & ~JENTRY_POSMASK) | totallen;
+ meta |= JENTRY_HAS_LEN;
copyToBuffer(buffer, metaoffset, (char *) &meta, sizeof(JEntry));
metaoffset += sizeof(JEntry);
}
@@ -1592,3 +1597,39 @@ uniqueifyJsonbObject(JsonbValue *object)
object->val.object.nPairs = res + 1 - object->val.object.pairs;
}
}
+
+uint32
+jsonb_get_offset(const JEntry *ja, int index)
+{
+ uint32 off = 0;
+ int i;
+
+ /*
+ * Each absolute entry contains the *end* offset. Start offset of this
+ * entry is equal to the end offset of the previous entry.
+ */
+ for (i = index - 1; i >= 0; i--)
+ {
+ off += JBE_POSFLD(ja[i]);
+ if (!JBE_HAS_LEN(ja[i]))
+ break;
+ }
+ return off;
+}
+
+uint32
+jsonb_get_length(const JEntry *ja, int index)
+{
+ uint32 off;
+ uint32 len;
+
+ if (JBE_HAS_LEN(ja[index]))
+ len = JBE_POSFLD(ja[index]);
+ else
+ {
+ off = jsonb_get_offset(ja, index);
+ len = JBE_POSFLD(ja[index]) - off;
+ }
+
+ return len;
+}
diff --git a/src/include/utils/jsonb.h b/src/include/utils/jsonb.h
index 5f2594b..dae6512 100644
--- a/src/include/utils/jsonb.h
+++ b/src/include/utils/jsonb.h
@@ -102,12 +102,12 @@ typedef struct JsonbValue JsonbValue;
* to JB_FSCALAR | JB_FARRAY.
*
* To encode the length and offset of the variable-length portion of each
- * node in a compact way, the JEntry stores only the end offset within the
- * variable-length portion of the container node. For the first JEntry in the
- * container's JEntry array, that equals to the length of the node data. For
- * convenience, the JENTRY_ISFIRST flag is set. The begin offset and length
- * of the rest of the entries can be calculated using the end offset of the
- * previous JEntry in the array.
+ * node in a compact way, the JEntry stores either the length of the element,
+ * or its end offset within the variable-length portion of the container node.
+ * Entries that store a length are marked with the JENTRY_HAS_LEN flag, other
+ * entries store an end offset. The begin offset and length of each entry
+ * can be calculated by scanning backwards to the previous entry storing an
+ * end offset, and adding up the lengths of the elements in between.
*
* Overall, the Jsonb struct requires 4-bytes alignment. Within the struct,
* the variable-length portion of some node types is aligned to a 4-byte
@@ -121,7 +121,8 @@ typedef struct JsonbValue JsonbValue;
/*
* Jentry format.
*
- * The least significant 28 bits store the end offset of the entry (see
+ * The least significant 28 bits store the end offset or the length of the
+ * entry, depending on whether JENTRY_HAS_LEN flag is set (see
* JBE_ENDPOS, JBE_OFF, JBE_LEN macros below). The next three bits
* are used to store the type of the entry. The most significant bit
* is set on the first entry in an array of JEntrys.
@@ -139,8 +140,10 @@ typedef uint32 JEntry;
#define JENTRY_ISNULL 0x40000000
#define JENTRY_ISCONTAINER 0x50000000 /* array or object */
+#define JENTRY_HAS_LEN 0x80000000
+
/* Note possible multiple evaluations */
-#define JBE_ISFIRST(je_) (((je_) & JENTRY_ISFIRST) != 0)
+#define JBE_HAS_LEN(je_) (((je_) & JENTRY_HAS_LEN) != 0)
#define JBE_ISSTRING(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISSTRING)
#define JBE_ISNUMERIC(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISNUMERIC)
#define JBE_ISCONTAINER(je_) (((je_) & JENTRY_TYPEMASK) == JENTRY_ISCONTAINER)
@@ -150,13 +153,20 @@ typedef uint32 JEntry;
#define JBE_ISBOOL(je_) (JBE_ISBOOL_TRUE(je_) || JBE_ISBOOL_FALSE(je_))
/*
- * Macros for getting the offset and length of an element. Note multiple
- * evaluations and access to prior array element.
+ * Macros for getting the offset and length of an element.
*/
-#define JBE_ENDPOS(je_) ((je_) & JENTRY_POSMASK)
-#define JBE_OFF(ja, i) ((i) == 0 ? 0 : JBE_ENDPOS((ja)[i - 1]))
-#define JBE_LEN(ja, i) ((i) == 0 ? JBE_ENDPOS((ja)[i]) \
- : JBE_ENDPOS((ja)[i]) - JBE_ENDPOS((ja)[i - 1]))
+#define JBE_POSFLD(je_) ((je_) & JENTRY_POSMASK)
+#define JBE_OFF(ja, i) jsonb_get_offset(ja, i)
+#define JBE_LEN(ja, i) jsonb_get_length(ja, i)
+
+/*
+ * Store an absolute end offset every JBE_STORE_LEN_STRIDE elements (for an
+ * array) or key/value pairs (for an object). Others are stored as lengths.
+ */
+#define JBE_STORE_LEN_STRIDE 8
+
+extern uint32 jsonb_get_offset(const JEntry *ja, int index);
+extern uint32 jsonb_get_length(const JEntry *ja, int index);
/*
* A jsonb array or object node, within a Jsonb Datum.
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
For comparison, here's a patch that implements the scheme that Alexander
Korotkov suggested, where we store an offset every 8th element, and a
length in the others. It compresses Larry's example to 525 bytes.
Increasing the "stride" from 8 to 16 entries, it compresses to 461 bytes.
A nice thing about this patch is that it's on-disk compatible with the
current format, hence initdb is not required.
TBH, I think that's about the only nice thing about it :-(. It's
conceptually a mess. And while I agree that this way avoids creating
a big-O performance issue for large arrays/objects, I think the micro
performance is probably going to be not so good. The existing code is
based on the assumption that JBE_OFF() and JBE_LEN() are negligibly cheap;
but with a solution like this, it's guaranteed that one or the other is
going to be not-so-cheap.
I think if we're going to do anything to the representation at all,
we need to refactor the calling code; at least fixing the JsonbIterator
logic so that it tracks the current data offset rather than expecting to
able to compute it at no cost.
The difficulty in arguing about this is that unless we have an agreed-on
performance benchmark test, it's going to be a matter of unsupported
opinions whether one solution is faster than another. Have we got
anything that stresses key lookup and/or array indexing?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Aug 13, 2014 at 09:01:43PM -0400, Tom Lane wrote:
I wrote:
That's a fair question. I did a very very simple hack to replace the item
offsets with item lengths -- turns out that that mostly requires removing
some code that changes lengths to offsets ;-). I then loaded up Larry's
example of a noncompressible JSON value, and compared pg_column_size()
which is just about the right thing here since it reports datum size after
compression. Remembering that the textual representation is 12353 bytes:json: 382 bytes
jsonb, using offsets: 12593 bytes
jsonb, using lengths: 406 bytesOh, one more result: if I leave the representation alone, but change
the compression parameters to set first_success_by to INT_MAX, this
value takes up 1397 bytes. So that's better, but still more than a
3X penalty compared to using lengths. (Admittedly, this test value
probably is an outlier compared to normal practice, since it's a hundred
or so repetitions of the same two strings.)
Uh, can we get compression for actual documents, rather than duplicate
strings?
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian <bruce@momjian.us> writes:
Uh, can we get compression for actual documents, rather than duplicate
strings?
[ shrug... ] What's your proposed set of "actual documents"?
I don't think we have any corpus of JSON docs that are all large
enough to need compression.
This gets back to the problem of what test case are we going to consider
while debating what solution to adopt.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Uh, can we get compression for actual documents, rather than duplicate
strings?[ shrug... ] What's your proposed set of "actual documents"?
I don't think we have any corpus of JSON docs that are all large
enough to need compression.This gets back to the problem of what test case are we going to consider
while debating what solution to adopt.
Uh, we just one need one 12k JSON document from somewhere. Clearly this
is something we can easily get.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 14, 2014 at 11:52 AM, Bruce Momjian <bruce@momjian.us> wrote:
On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote:
Bruce Momjian <bruce@momjian.us> writes:
Uh, can we get compression for actual documents, rather than duplicate
strings?[ shrug... ] What's your proposed set of "actual documents"?
I don't think we have any corpus of JSON docs that are all large
enough to need compression.This gets back to the problem of what test case are we going to consider
while debating what solution to adopt.Uh, we just one need one 12k JSON document from somewhere. Clearly this
is something we can easily get.
it's trivial to make a large json[b] document:
select length(to_json(array(select row(a.*) from pg_attribute a))::TEXT);
select
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Bruce Momjian <bruce@momjian.us> writes:
On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote:
This gets back to the problem of what test case are we going to consider
while debating what solution to adopt.
Uh, we just one need one 12k JSON document from somewhere. Clearly this
is something we can easily get.
I would put little faith in a single document as being representative.
To try to get some statistics about a real-world case, I looked at the
delicio.us dataset that someone posted awhile back (1252973 JSON docs).
These have a minimum length (in text representation) of 604 bytes and
a maximum length of 5949 bytes, which means that they aren't going to
tell us all that much about large JSON docs, but this is better than
no data at all.
Since documents of only a couple hundred bytes aren't going to be subject
to compression, I made a table of four columns each containing the same
JSON data, so that each row would be long enough to force the toast logic
to try to do something. (Note that none of these documents are anywhere
near big enough to hit the refuses-to-compress problem.) Given that,
I get the following statistics for pg_column_size():
min max avg
JSON (text) representation 382 1155 526.5
HEAD's JSONB representation 493 1485 695.1
all-lengths representation 440 1257 615.3
So IOW, on this dataset the existing JSONB representation creates about
32% bloat compared to just storing the (compressed) user-visible text,
and switching to all-lengths would about halve that penalty.
Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done. I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place. Do we have other JSON
corpuses to look at?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 14, 2014 at 10:57 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done. I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place. Do we have other JSON
corpuses to look at?
Yes. Pavel posted some representative JSON data a while back:
http://pgsql.cz/data/data.dump.gz (it's a plain dump)
--
Peter Geoghegan
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Aug 14, 2014 at 01:57:14PM -0400, Tom Lane wrote:
Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done. I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place. Do we have other JSON
corpuses to look at?
Yes, that is what I was expecting --- once the whitespace and syntax
sugar is gone in JSONB, I was unclear how much compression would help.
--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com
+ Everyone has their own god. +
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 08/14/2014 11:13 AM, Bruce Momjian wrote:
On Thu, Aug 14, 2014 at 01:57:14PM -0400, Tom Lane wrote:
Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done. I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place. Do we have other JSON
corpuses to look at?Yes, that is what I was expecting --- once the whitespace and syntax
sugar is gone in JSONB, I was unclear how much compression would help.
I thought the destruction case was when we have enough top-level keys
that the offsets are more than 1K total, though, yes?
So we need to test that set ...
--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Import Notes
Reply to msg id not found: WMb0e5f85d168d032468e69cde934efd3c67855fdf1fe95bebc3b555d6d30122963da654ef3ddeb1d6f7ffd564884c261a@asav-2.01.com
I did quick test on the same bookmarks to test performance of 9.4beta2 and
9.4beta2+patch
The query was the same we used in pgcon presentation:
SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb;
table size | time (ms)
9.4beta2: 1374 Mb | 1160
9.4beta2+patch: 1373 Mb | 1213
Yes, performance degrades, but not much. There is also small win in table
size, but bookmarks are not big, so it's difficult to say about compression.
Oleg
On Thu, Aug 14, 2014 at 9:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Show quoted text
Bruce Momjian <bruce@momjian.us> writes:
On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote:
This gets back to the problem of what test case are we going to consider
while debating what solution to adopt.Uh, we just one need one 12k JSON document from somewhere. Clearly this
is something we can easily get.I would put little faith in a single document as being representative.
To try to get some statistics about a real-world case, I looked at the
delicio.us dataset that someone posted awhile back (1252973 JSON docs).
These have a minimum length (in text representation) of 604 bytes and
a maximum length of 5949 bytes, which means that they aren't going to
tell us all that much about large JSON docs, but this is better than
no data at all.Since documents of only a couple hundred bytes aren't going to be subject
to compression, I made a table of four columns each containing the same
JSON data, so that each row would be long enough to force the toast logic
to try to do something. (Note that none of these documents are anywhere
near big enough to hit the refuses-to-compress problem.) Given that,
I get the following statistics for pg_column_size():min max avg
JSON (text) representation 382 1155 526.5
HEAD's JSONB representation 493 1485 695.1
all-lengths representation 440 1257 615.3
So IOW, on this dataset the existing JSONB representation creates about
32% bloat compared to just storing the (compressed) user-visible text,
and switching to all-lengths would about halve that penalty.Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done. I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place. Do we have other JSON
corpuses to look at?regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I attached a json file of approximately 513K. It contains two repetitions
of a single json structure. The values are quasi-random. It might make a
decent test case of meaningfully sized data.
best
On Thu, Aug 14, 2014 at 2:25 PM, Oleg Bartunov <obartunov@gmail.com> wrote:
Show quoted text
I did quick test on the same bookmarks to test performance of 9.4beta2 and
9.4beta2+patchThe query was the same we used in pgcon presentation:
SELECT count(*) FROM jb WHERE jb @> '{"tags":[{"term":"NYC"}]}'::jsonb;table size | time (ms)
9.4beta2: 1374 Mb | 1160
9.4beta2+patch: 1373 Mb | 1213Yes, performance degrades, but not much. There is also small win in
table size, but bookmarks are not big, so it's difficult to say about
compression.Oleg
On Thu, Aug 14, 2014 at 9:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Bruce Momjian <bruce@momjian.us> writes:
On Thu, Aug 14, 2014 at 12:22:46PM -0400, Tom Lane wrote:
This gets back to the problem of what test case are we going to
consider
while debating what solution to adopt.
Uh, we just one need one 12k JSON document from somewhere. Clearly this
is something we can easily get.I would put little faith in a single document as being representative.
To try to get some statistics about a real-world case, I looked at the
delicio.us dataset that someone posted awhile back (1252973 JSON docs).
These have a minimum length (in text representation) of 604 bytes and
a maximum length of 5949 bytes, which means that they aren't going to
tell us all that much about large JSON docs, but this is better than
no data at all.Since documents of only a couple hundred bytes aren't going to be subject
to compression, I made a table of four columns each containing the same
JSON data, so that each row would be long enough to force the toast logic
to try to do something. (Note that none of these documents are anywhere
near big enough to hit the refuses-to-compress problem.) Given that,
I get the following statistics for pg_column_size():min max avg
JSON (text) representation 382 1155 526.5
HEAD's JSONB representation 493 1485 695.1
all-lengths representation 440 1257 615.3
So IOW, on this dataset the existing JSONB representation creates about
32% bloat compared to just storing the (compressed) user-visible text,
and switching to all-lengths would about halve that penalty.Maybe this is telling us it's not worth changing the representation,
and we should just go do something about the first_success_by threshold
and be done. I'm hesitant to draw such conclusions on the basis of a
single use-case though, especially one that doesn't really have that
much use for compression in the first place. Do we have other JSON
corpuses to look at?regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Attachments:
random.json.zipapplication/zip; name=random.json.zipDownload
PK ���D random.jsonUX �S!��S� ��i�#I��WBR�aW6�������=Z����4g�2�vVa'2";��pG����; 73�uef�GU ����{�������������*f��`b2 �`��`������>~���x����S�b�
(�3��1���!������a{b��(�N��8;������O���{)cf�;�����������'�>|GR����C��=�(�����DF�B{�����K�$�
�q��T��#���n�����iw��?��������������i���f�����������������������c��|��n�><=z|:�Cv�a��o���#������n����S��}~����������w�S�)�����TnA\����<�Dg!E�
� ���)G��4�r��n��?<=}~�������[�!���?i���'�����_?�w��]t�w��?�����6����r����?>=��<�����\�����
�yJ�%V���t����'uw�w��e��^���������������{��|�p�/���?��\J�����>�==������?���W���?=�� ��*j�5�S��-����Kv�h%��9e��E������-��II�?�_�`�|d���������'���r���������.�F,>Z������J�K�Gww�G�����.��I�O7���~__��x1x��l��w���|�_���oc��S�������������>��`���p@��A���%��H����y�<����r|����W���
c��/����cx�^��w����@������t����g@L����xah)�xt�O7�~O�'`�����y������S���A#��sy�'<K_rs{�����Zt�x����O�n�s�m����+����|�������l����7����!�w7�o|�B��7������/�����P�s�����n���oz�nS9x=����o~��P~���1�������������������8�"����X>��d��>)�����[�~<~I�
�7\~�p�xg��7�y�������-oz?|"7�q���^��z���h'����7���<��>��w��;|yz�w���������b���������?�s���������&>�w���U�=-�/���U�h��� ���t��� �wb'����������%
\3B�>xY[��$�������}�6�����|w����=<���y�[|�������y���h�R���~N������?'��%l��>�������)��m������������n�����{zZ���m������]�w��a}��z�]��� f=����g��g������,&�O�<���?n����b����q������MH�7�����Q�pu��q8Oq�=�G�
�����A�� ������ =�)J��z�Q|����|N }���������K������ #X<A*x�I����#>W��P�~�3^6:���\.�#�����h���(���c�*;�;����)=�c~�c��N)��;���2�n�e/���X{��PYq��g&����,~�2�����R�����������||V�\=�����p���W}��O~q��������#���?����|�Y���E.�M�;}�=���� �5fM�h����~���cZ��r*�.q������~j�����N�h(w�w��>��������r����������H��|�����RN����q
:_���o������JT��������Oo�qy�G��������@NQ�`��{S��)b���tG@���[��lt���(5�J�rN�����$D2��ru0[�_��!��`��m8;��� ����
`
���\>�5|����t�P|\-�f�x�o��>���/�p�U%?���h��O���*��KT=�1�,O���cF�M��S�S.c�zx������S�s���1�}>����_�����lc0j���`��`,�Q
������RH�V�!x�"�2�����>}M�.���Y<���X�i���)X����O�����t��}�C9@Xf�*�/����!��ST�������+�^��=m�?���������I@�h|�����-r/_�-�n����������a�X�=^���.�#� ���t�8�S��8���118XQg��b�Hmd'���N�@\�$x�m�s���i�0��<$n*C��)V�s��g&�$���K�(/��EL��!�����n�}����)W���o�r�W.}
�����C<X���*Cx�����-4����J����y<��<A������!���>x����}����%�\<���������0����9�%�\-�JS6�K���5�-�������_�5��hY���E4���r9k
�����V)�z���o-�������[[~s����c�T�e��o�c$�g�,;'���P�m�NNoN�p����������l���X��d�G����������l��������|W���(��_��Y��<{�F1&@��@��r� �D����������X���A�7�h�o�l���)��I����v�=�["�Rj����y�s 3�l���E�&J�����6�;|��������������AA?�C w��%�S@y�V��J�R^
�0��p��t������H?/�4����������z�d���X�
��;��eo�|-�J����,%��ak������<�Z[��i���b��iEB����,��!A�BG��d)6�����z9]����o�3w��l�
3m�����6�`^G� 8�7BY���:`L+o��������������c��=���3#���.�j_�K-k��hW�[3��
��:q%i�-����J�,y��$�5O���Z�,|����~������!��E���LXL�����|����w�����^y��GGS���S����l���x<��3������\�r�N��z���N-��uk�8e��|j9.�pWwG��<��#-:{j�.�|���_�����y��>� ��J���f�\�����{@��G���|��Y ��X���m��MM�8�z��[~�O�r������������}�s���DmAg1����
��q��T�R���Z��Og�[;s&�����D2����'����W.��++�I6�h�!D��)��9�0��/^��!�S
K��i��7�I;��� �H Q�c(��H�}��Ze\�������8vA���hKx-T����]%���.d���l�����u��>R��_��: Sk�-U����'���7,e��F���LY��������h)��6�u���Z����������.�~nc�L�^�#��.�BM<9�;Nt�a�h���;�W�v/IUG�u���HSs�����u��D�%88�UQDG�����:`���B��w�0'
���0�i�K�do�7����c�+\f����`�c%:
�DC����������3f;��U>��3�]�t5��-�M��_.�[����r��vM>�9��5q;��\����d[-H�V�B
hC^��*�!Y�$ #i�$#�<G��j���O>��`VsX��Yj�^��_kb{s@�~�A���lW�\���K�Z�����V''ut'��@�[���t&��7��\i���Z�Jg�/������g�dA������$�1��5���N�`�5�,<�h���yW0o�3
������e��$q��uc�6m�K�E
'R������1U?�d�N������b�m:w�B��mx��>�T�� ����
A�������Z�[n���olZ66@; �clC+"�B- ��h����B$a����S��S
g���k�����>.���.x�!s_�t�Wq{
���m��5�:�/�o_*�8���N���/���M��L�m!M�S�5l6�b,AC0�� XX +Uo� �m�AQK��p�Za��;���;u�F�!�:�����W0�� ���*�j}�'oyd,_x:M�d��!3;1O����]�Zil5��Oc��f����+��`�*f� 3�V�e�6�qN���)�G�L�y�����@�
�c��������l�����my��C7�c?f��,�4yz�,V^��u���c�C��c����60c������.�c�����Tc���iS>g���g
>'��Ip��:�k�_��t�g0�V������@��&esxs���l��2�Kr����(�pN
�i��r��z���\����V�S0O��<L��m��!��6�Gy���v#��C!��"h�-�N��}�
p���'��bs ��`�]
�t�m�����62�
e��J#�����Z��\s����$Z�� �yKmF���Y��>�\���������d�M�������`e�������It�*I���9���5'