HEAD seems to generate larger WAL regarding GIN index
Hi,
I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.
---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);
CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();
SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------
The results of WAL size are
960 MB (9.3)
2113 MB (HEAD)
The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 03/15/2014 08:40 PM, Fujii Masao wrote:
Hi,
I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------The results of WAL size are
960 MB (9.3)
2113 MB (HEAD)The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?
It was somewhat expected. Updating individual items on the new-format
GIN pages requires decompressing and recompressing the page, and the
recompressed posting lists need to be WAL-logged. Which generates much
larger WAL records.
That said, I didn't expect the difference to be quite that big when
you're appending to the end of the table. When the new entries go to the
end of the posting lists, you only need to recompress and WAL-log the
last posting list, which is max 256 bytes long. But I guess that's still
a lot more WAL than in the old format.
That could be optimized, but I figured we can live with it, thanks to
the fastupdate feature. Fastupdate allows amortizing that cost over
several insertions. But of course, you explicitly disabled that...
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:
On 03/15/2014 08:40 PM, Fujii Masao wrote:
Hi,
I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------The results of WAL size are
960 MB (9.3)
2113 MB (HEAD)The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?It was somewhat expected. Updating individual items on the new-format GIN
pages requires decompressing and recompressing the page, and the
recompressed posting lists need to be WAL-logged. Which generates much
larger WAL records.That said, I didn't expect the difference to be quite that big when you're
appending to the end of the table. When the new entries go to the end of
the posting lists, you only need to recompress and WAL-log the last posting
list, which is max 256 bytes long. But I guess that's still a lot more WAL
than in the old format.That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...
Let me know if you want me to write patch addressing this issue.
------
With best regards,
Alexander Korotkov.
On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:On 03/15/2014 08:40 PM, Fujii Masao wrote:
Hi,
I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------The results of WAL size are
960 MB (9.3)
2113 MB (HEAD)The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?It was somewhat expected. Updating individual items on the new-format GIN
pages requires decompressing and recompressing the page, and the
recompressed posting lists need to be WAL-logged. Which generates much
larger WAL records.That said, I didn't expect the difference to be quite that big when you're
appending to the end of the table. When the new entries go to the end of the
posting lists, you only need to recompress and WAL-log the last posting
list, which is max 256 bytes long. But I guess that's still a lot more WAL
than in the old format.
I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
and then found its max seems more than 256B. Am I missing something?
What I observed is
[In HEAD]
At first, the size of GIN-related WAL is gradually increasing up to about 1400B.
rmgr: Gin len (rec/tot): 48/ 80, tx: 1813,
lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F
rmgr: Gin len (rec/tot): 56/ 88, tx: 1813,
lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
rmgr: Gin len (rec/tot): 64/ 96, tx: 1813,
lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
...
rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813,
lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T
rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813,
lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
tree, node: 1663/12945/16441 blkno: 4
Then the size decreases to about 100B and is gradually increasing
again up to 320B.
rmgr: Gin len (rec/tot): 116/ 148, tx: 1813,
lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1372 (compressed)
rmgr: Gin len (rec/tot): 40/ 72, tx: 1813,
lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T
...
rmgr: Gin len (rec/tot): 118/ 150, tx: 1813,
lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1374 (compressed)
...
rmgr: Gin len (rec/tot): 288/ 320, tx: 1813,
lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
length: 1544 (compressed)
Then the size decreases to 66B and is gradually increasing again up to 320B.
This increase and decrease of WAL size seems to continue.
[In 9.3]
At first, the size of GIN-related WAL is gradually increasing up to about 2700B.
rmgr: Gin len (rec/tot): 52/ 84, tx: 1812,
lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
isdelete F updateBlkno:4294967295
rmgr: Gin len (rec/tot): 60/ 92, tx: 1812,
lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
...
rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812,
lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812,
lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
tree, node: 1663/12896/16441 blkno: 6
The size decreases to 66B and then is never changed.
rmgr: Gin len (rec/tot): 34/ 66, tx: 1812,
lsn: 0/026D9F00, prev 0/026D9EB8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 6 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295
rmgr: Gin len (rec/tot): 34/ 66, tx: 1812,
lsn: 0/026D9F48, prev 0/026D9F00, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 7 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295
rmgr: Gin len (rec/tot): 34/ 66, tx: 1812,
lsn: 0/026D9F90, prev 0/026D9F48, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 8 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295
...
This difference in GIN-related WAL size seems to cause HEAD to generate more
than two times bigger WAL. Unfortunately the gap of WAL size would be
continuously increasing :(
That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...Let me know if you want me to write patch addressing this issue.
Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Fujii Masao escribi�:
On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...Let me know if you want me to write patch addressing this issue.
Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.
Users that disable FASTUPDATE, in my experience, do so because their
stock work_mem is way too high and GIN searches become too slow due to
having to scan too large a list. I think it might make sense to invest
a modest amount of time in getting FASTUPDATE to be sized completely
differently from today -- perhaps base it on a hardcoded factor of
BLCKSZ, rather than work_mem. Or, if we really need to make it
configurable, then let it have its own parameter.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 03/17/2014 03:20 PM, Fujii Masao wrote:
On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
and then found its max seems more than 256B. Am I missing something?What I observed is
[In HEAD]
At first, the size of GIN-related WAL is gradually increasing up to about 1400B.
rmgr: Gin len (rec/tot): 48/ 80, tx: 1813,
lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F
rmgr: Gin len (rec/tot): 56/ 88, tx: 1813,
lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
rmgr: Gin len (rec/tot): 64/ 96, tx: 1813,
lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
...
rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813,
lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T
rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813,
lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
tree, node: 1663/12945/16441 blkno: 4
This corresponds to the stage where the items are stored in-line in the
entry-tree. After it reaches a certain size, a posting tree is created.
Then the size decreases to about 100B and is gradually increasing
again up to 320B.rmgr: Gin len (rec/tot): 116/ 148, tx: 1813,
lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1372 (compressed)
rmgr: Gin len (rec/tot): 40/ 72, tx: 1813,
lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T
...
rmgr: Gin len (rec/tot): 118/ 150, tx: 1813,
lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1374 (compressed)
...
rmgr: Gin len (rec/tot): 288/ 320, tx: 1813,
lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
length: 1544 (compressed)Then the size decreases to 66B and is gradually increasing again up to 320B.
This increase and decrease of WAL size seems to continue.
Here the new items are appended to posting tree pages. This is where the
maximum of 256 bytes I mentioned applies. 256 bytes is the max size of
one compressed posting list, the WAL record containing it includes some
other stuff too, which adds up to that 320 bytes.
[In 9.3]
At first, the size of GIN-related WAL is gradually increasing up to about 2700B.rmgr: Gin len (rec/tot): 52/ 84, tx: 1812,
lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
isdelete F updateBlkno:4294967295
rmgr: Gin len (rec/tot): 60/ 92, tx: 1812,
lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
...
rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812,
lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812,
lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
tree, node: 1663/12896/16441 blkno: 6The size decreases to 66B and then is never changed.
Same mechanism on 9.3, but the insertions to the posting tree pages are
constant size.
That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...Let me know if you want me to write patch addressing this issue.
Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.
Ok, let's think about it a little bit. I think there are three fairly
simple ways to address this:
1. The GIN data leaf "recompress" record contains an offset called
"unmodifiedlength", and the data that comes after that offset.
Currently, the record is written so that unmodifiedlength points to the
end of the last compressed posting list stored on the page that was not
modified, followed by all the modified ones. The straightforward way to
cut down the WAL record size would be to be more fine-grained than that,
and for the posting lists that were modified, only store the difference
between the old and new version.
To make this approach work well for random insertions, not just
appending to the end, we would also need to make the logic in
leafRepackItems a bit smarter so that it would not re-encode all the
posting lists, after the first modified one.
2. Instead of storing the new compressed posting list in the WAL record,
store only the new item pointers added to the page. WAL replay would
then have to duplicate the work done in the main insertion code path:
find the right posting lists to insert to, decode them, add the new
items, and re-encode.
The upside of that would be that the WAL format would be very compact.
It would be quite simple to implement - you just need to call the same
functions we use in the main insertion codepath to insert the new items.
It could be more expensive, CPU-wise, to replay the records, however.
This record format would be higher-level, in the sense that we would not
store the physical copy of the compressed posting list as it was formed
originally. The same work would be done at WAL replay. As the code
stands, it will produce exactly the same result, but that's not
guaranteed if we make bugfixes to the code later, and a master and
standby are running different minor version. There's not necessarily
anything wrong with that, but it's something to keep in mind.
3. Just reduce the GinPostingListSegmentMaxSize constant from 256, to
say 128. That would halve the typical size of a WAL record that appends
to the end. However, it would not help with insertions in the middle of
a posting list, only appends to the end, and it would bloat the pages
somewhat, as you would waste more space on the posting list headers.
I'm leaning towards option 2. Alexander, what do you think?
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
2. Instead of storing the new compressed posting list in the WAL record,
store only the new item pointers added to the page. WAL replay would
then have to duplicate the work done in the main insertion code path:
find the right posting lists to insert to, decode them, add the new
items, and re-encode.
That sounds fairly dangerous ... is any user-defined code involved in
those decisions?
This record format would be higher-level, in the sense that we would not
store the physical copy of the compressed posting list as it was formed
originally. The same work would be done at WAL replay. As the code
stands, it will produce exactly the same result, but that's not
guaranteed if we make bugfixes to the code later, and a master and
standby are running different minor version. There's not necessarily
anything wrong with that, but it's something to keep in mind.
Version skew would be a hazard too, all right. I think it's important
that WAL replay be a pretty mechanical, predictable process.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 03/17/2014 04:33 PM, Tom Lane wrote:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
2. Instead of storing the new compressed posting list in the WAL record,
store only the new item pointers added to the page. WAL replay would
then have to duplicate the work done in the main insertion code path:
find the right posting lists to insert to, decode them, add the new
items, and re-encode.That sounds fairly dangerous ... is any user-defined code involved in
those decisions?
No.
This record format would be higher-level, in the sense that we would not
store the physical copy of the compressed posting list as it was formed
originally. The same work would be done at WAL replay. As the code
stands, it will produce exactly the same result, but that's not
guaranteed if we make bugfixes to the code later, and a master and
standby are running different minor version. There's not necessarily
anything wrong with that, but it's something to keep in mind.Version skew would be a hazard too, all right. I think it's important
that WAL replay be a pretty mechanical, predictable process.
Yeah. One particular point to note is that if in one place we do the
more "high level" thing and have WAL replay re-encode the page as it
sees fit, then we can *not* rely on the page being byte-by-byte
identical in other places. Like, in vacuum, where items are deleted.
Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct
the page, instead of making a physical copy of the modified parts. And
_bt_restore_page even inserts the items physically in different order
than the normal codepath does. So for good or bad, there is some
precedence for this.
The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master
server adds an item to a page, and it just fits, but with the
compression logic the standby version has, it cannot make it fit. As an
escape hatch for that, we could have the WAL replay code try the
compression again, with a larger max. posting list size, if it doesn't
fit at first. And/or always leave something like 10 bytes of free space
on every data page to make up for small differences in the logic.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the
page, instead of making a physical copy of the modified parts. And
_bt_restore_page even inserts the items physically in different order than
the normal codepath does. So for good or bad, there is some precedence for
this.
Yikes.
The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.
That scares the crap out of me. I don't see any intrinsic problem
with relying on the existence page contents to figure out how to roll
forward, as PageAddItem does; after all, we do FPIs precisely so that
the page is in a known good state when we start. However, I really
think we ought to try hard to make this deterministic in terms of what
the resulting state of the page is; anything else seems like it's
playing with fire, and I bet we'll get burned sooner rather than
later.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the
page, instead of making a physical copy of the modified parts. And
_bt_restore_page even inserts the items physically in different order than
the normal codepath does. So for good or bad, there is some precedence for
this.
Yikes.
Yeah. I think it's arguably a bug that _bt_restore_page works like that,
even though we've not been burnt up to now.
The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.
That scares the crap out of me.
Likewise. Saving some WAL space is *not* worth this kind of risk.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 03/17/2014 05:35 PM, Tom Lane wrote:
Robert Haas <robertmhaas@gmail.com> writes:
On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.That scares the crap out of me.
Likewise. Saving some WAL space is *not* worth this kind of risk.
One fairly good compromise would be to only include the new items, not
the whole modified compression lists, and let the replay logic do the
re-encoding of the posting lists. But also include the cutoff points of
each posting list in the WAL record. That way the replay code would have
no freedom in how it decides to split the items into compressed lists,
that would be fully specified by the WAL record.
Here's a refresher for those who didn't follow the development of the
new page format: The data page basically contains a list of
ItemPointers. The items are compressed, to save disk space. However, to
make random access faster, all the items on the page are not compressed
as one big list. Instead, the big array of items is split into roughly
equal chunks, and each chunk is compressed separately. The chunks are
stored on the page one after each other. (The chunks are called "posting
lists" in the code, the struct is called GinPostingListData)
The compression is completely deterministic (each item is stored as a
varbyte-encoded delta from the previous item), but there are no hard
rules on how the items on the page ought to be divided into the posting
lists. Currently, the code tries to maintain a max size of 256 bytes per
list - but it will cope with any size it finds on disk. This is where
the danger lies, where we could end up with a different physical page
after WAL replay, if we just include the new items in the WAL record.
The WAL replay might decide to split the items into posting lists
differently than was originally done. (as the code stands, it would
always make the same decision, completely deterministically, but that
might change in a minor version if we're not careful)
We can tie WAL replay's hands about that, if we include a list of items
that form the posting lists in the WAL record. That adds some bloat,
compared to only including the new items, but not too much. (and we
still only need do that for posting lists following the first modified one.)
Alexander, would you like to give that a shot, or will I?
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Mar 17, 2014 at 10:44 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Fujii Masao escribió:
On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...Let me know if you want me to write patch addressing this issue.
Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.Users that disable FASTUPDATE, in my experience, do so because their
stock work_mem is way too high and GIN searches become too slow due to
having to scan too large a list.
Yes. Another reason that I've heard from users so far is that
the size of GIN index with FASTUPDATE=off is likely to be smaller
than that with FASTUPDATE=on.
I think it might make sense to invest
a modest amount of time in getting FASTUPDATE to be sized completely
differently from today -- perhaps base it on a hardcoded factor of
BLCKSZ, rather than work_mem. Or, if we really need to make it
configurable, then let it have its own parameter.
I prefer to have the parameter. When users create multiple GIN indexes
for various uses, they might want to use different thresholds of the pending
list for each index. So, GIN index parameter might be better than GUC one.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 15/03/14 20:27, Heikki Linnakangas wrote:
That said, I didn't expect the difference to be quite that big when
you're appending to the end of the table. When the new entries go to
the end of the posting lists, you only need to recompress and WAL-log
the last posting list, which is max 256 bytes long. But I guess that's
still a lot more WAL than in the old format.That could be optimized, but I figured we can live with it, thanks to
the fastupdate feature. Fastupdate allows amortizing that cost over
several insertions. But of course, you explicitly disabled that...
In a concurrent update environment, fastupdate as it is in 9.2 is not
really useful. It may be that you can bulk up insertion, but you have no
control over who ends up paying the debt. Doubling the amount of wal
from gin-indexing would be pretty tough for us, in 9.2 we generate
roughly 1TB wal / day, keeping it
for some weeks to be able to do PITR. The wal are mainly due to
gin-index updates as new data is added and needs to be searchable by
users. We do run gzip that cuts it down to 25-30% before keeping the for
too long, but doubling this is going to be a migration challenge.
If fast-update could be made to work in an environment where we both
have users searching the index and manually updating it and 4+ backend
processes updating the index concurrently then it would be a good
benefit to gain.
the gin index currently contains 70+ million records with and average
tsvector of 124 terms.
--
Jesper .. trying to add some real-world info.
- Heikki
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I came up with the attached patch, to reduce the WAL volume of GIN
insertions. It become fairly large, but I guess that's not too
surprising as the old WAL-logging method was basically to dump the whole
page to WAL record. This is now a lot more fine-grained and smarter. I
separated constructing the WAL record from copying the changes back to
the disk page, which IMHO is a readability improvement even though it's
more code.
There are two parts to this patch:
* leafRepackItems has been rewritten. The previous coding basically
searched for the first modified item, and decoded and re-encoded
everything on the page that after that. Now it tries harder to avoid
re-encoding segments that are still reasonably sized (between 128 and
384 bytes, with the target for new segments being 256 bytes). This ought
to make random updates faster as a bonus, but I didn't performance test
that.
* Track more carefully which segments on the page have been modified.
The in-memory structure used to manipulate a page now keeps an action
code for each segment, indicating if the segment is completely new,
deleted, or replaced with new content, or if just some new items have
been added to it. These same actions are WAL-logged, and replayed in the
redo routine.
This brings the WAL volume back to the same ballpark as 9.3. Or better,
depending on the operation.
Fujii, Alexander, how does this look to you?
- Heikki
Attachments:
gin-more-compact-wal-1.patchtext/x-diff; name=gin-more-compact-wal-1.patchDownload+641-211
On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper@krogh.cc> wrote:
On 15/03/14 20:27, Heikki Linnakangas wrote:
That said, I didn't expect the difference to be quite that big when you're
appending to the end of the table. When the new entries go to the end of the
posting lists, you only need to recompress and WAL-log the last posting
list, which is max 256 bytes long. But I guess that's still a lot more WAL
than in the old format.That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...In a concurrent update environment, fastupdate as it is in 9.2 is not really
useful. It may be that you can bulk up insertion, but you have no control
over who ends up paying the debt. Doubling the amount of wal from
gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB
wal / day, keeping it
for some weeks to be able to do PITR. The wal are mainly due to gin-index
updates as new data is added and needs to be searchable by users. We do run
gzip that cuts it down to 25-30% before keeping the for too long, but
doubling this is going to be a migration challenge.If fast-update could be made to work in an environment where we both have
users searching the index and manually updating it and 4+ backend processes
updating the index concurrently then it would be a good benefit to gain.the gin index currently contains 70+ million records with and average
tsvector of 124 terms.
Should we try to install some hack around fastupdate for 9.4? I fear
the divergence between reasonable values of work_mem and reasonable
sizes for that list is only going to continue to get bigger. I'm sure
there's somebody out there who has work_mem = 16GB, and stuff like
263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
appeal of large values.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper@krogh.cc> wrote:
On 15/03/14 20:27, Heikki Linnakangas wrote:
That said, I didn't expect the difference to be quite that big when you're
appending to the end of the table. When the new entries go to the end of the
posting lists, you only need to recompress and WAL-log the last posting
list, which is max 256 bytes long. But I guess that's still a lot more WAL
than in the old format.That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...In a concurrent update environment, fastupdate as it is in 9.2 is not really
useful. It may be that you can bulk up insertion, but you have no control
over who ends up paying the debt. Doubling the amount of wal from
gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB
wal / day, keeping it
for some weeks to be able to do PITR. The wal are mainly due to gin-index
updates as new data is added and needs to be searchable by users. We do run
gzip that cuts it down to 25-30% before keeping the for too long, but
doubling this is going to be a migration challenge.If fast-update could be made to work in an environment where we both have
users searching the index and manually updating it and 4+ backend processes
updating the index concurrently then it would be a good benefit to gain.the gin index currently contains 70+ million records with and average
tsvector of 124 terms.Should we try to install some hack around fastupdate for 9.4? I fear
the divergence between reasonable values of work_mem and reasonable
sizes for that list is only going to continue to get bigger. I'm sure
there's somebody out there who has work_mem = 16GB, and stuff like
263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
appeal of large values.
Controlling the threshold of the size of pending list only by GUC doesn't
seem reasonable. Users may want to increase the threshold only for the
GIN index which can be updated heavily, and decrease it otherwise. So
I think that it's better to add new storage parameter for GIN index to control
the threshold, or both storage parameter and GUC.
Regards,
--
Fujii Masao
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Fujii Masao <masao.fujii@gmail.com> writes:
On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
Should we try to install some hack around fastupdate for 9.4? I fear
the divergence between reasonable values of work_mem and reasonable
sizes for that list is only going to continue to get bigger. I'm sure
there's somebody out there who has work_mem = 16GB, and stuff like
263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
appeal of large values.
Controlling the threshold of the size of pending list only by GUC doesn't
seem reasonable. Users may want to increase the threshold only for the
GIN index which can be updated heavily, and decrease it otherwise. So
I think that it's better to add new storage parameter for GIN index to control
the threshold, or both storage parameter and GUC.
Yeah, -1 for a GUC. A GIN-index-specific storage parameter seems more
appropriate. Or we could just hard-wire some maximum limit. Is it
really likely that users would trouble to set such a parameter if it
existed?
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers