HEAD seems to generate larger WAL regarding GIN index

Started by Fujii Masaoalmost 12 years ago17 messages
#1Fujii Masao
masao.fujii@gmail.com

Hi,

I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.

---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);

CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();

SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------

The results of WAL size are

960 MB (9.3)
2113 MB (HEAD)

The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#1)
Re: HEAD seems to generate larger WAL regarding GIN index

On 03/15/2014 08:40 PM, Fujii Masao wrote:

Hi,

I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.

---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);

CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();

SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------

The results of WAL size are

960 MB (9.3)
2113 MB (HEAD)

The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?

It was somewhat expected. Updating individual items on the new-format
GIN pages requires decompressing and recompressing the page, and the
recompressed posting lists need to be WAL-logged. Which generates much
larger WAL records.

That said, I didn't expect the difference to be quite that big when
you're appending to the end of the table. When the new entries go to the
end of the posting lists, you only need to recompress and WAL-log the
last posting list, which is max 256 bytes long. But I guess that's still
a lot more WAL than in the old format.

That could be optimized, but I figured we can live with it, thanks to
the fastupdate feature. Fastupdate allows amortizing that cost over
several insertions. But of course, you explicitly disabled that...

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alexander Korotkov
aekorotkov@gmail.com
In reply to: Heikki Linnakangas (#2)
Re: HEAD seems to generate larger WAL regarding GIN index

On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas <
hlinnakangas@vmware.com> wrote:

On 03/15/2014 08:40 PM, Fujii Masao wrote:

Hi,

I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.

---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);

CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();

SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------

The results of WAL size are

960 MB (9.3)
2113 MB (HEAD)

The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?

It was somewhat expected. Updating individual items on the new-format GIN
pages requires decompressing and recompressing the page, and the
recompressed posting lists need to be WAL-logged. Which generates much
larger WAL records.

That said, I didn't expect the difference to be quite that big when you're
appending to the end of the table. When the new entries go to the end of
the posting lists, you only need to recompress and WAL-log the last posting
list, which is max 256 bytes long. But I guess that's still a lot more WAL
than in the old format.

That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...

Let me know if you want me to write patch addressing this issue.

------
With best regards,
Alexander Korotkov.

#4Fujii Masao
masao.fujii@gmail.com
In reply to: Alexander Korotkov (#3)
Re: HEAD seems to generate larger WAL regarding GIN index

On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 03/15/2014 08:40 PM, Fujii Masao wrote:

Hi,

I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.

---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);

CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();

SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------

The results of WAL size are

960 MB (9.3)
2113 MB (HEAD)

The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?

It was somewhat expected. Updating individual items on the new-format GIN
pages requires decompressing and recompressing the page, and the
recompressed posting lists need to be WAL-logged. Which generates much
larger WAL records.

That said, I didn't expect the difference to be quite that big when you're
appending to the end of the table. When the new entries go to the end of the
posting lists, you only need to recompress and WAL-log the last posting
list, which is max 256 bytes long. But I guess that's still a lot more WAL
than in the old format.

I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
and then found its max seems more than 256B. Am I missing something?

What I observed is

[In HEAD]
At first, the size of GIN-related WAL is gradually increasing up to about 1400B.

rmgr: Gin len (rec/tot): 48/ 80, tx: 1813,
lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F
rmgr: Gin len (rec/tot): 56/ 88, tx: 1813,
lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
rmgr: Gin len (rec/tot): 64/ 96, tx: 1813,
lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
...
rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813,
lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T
rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813,
lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
tree, node: 1663/12945/16441 blkno: 4

Then the size decreases to about 100B and is gradually increasing
again up to 320B.

rmgr: Gin len (rec/tot): 116/ 148, tx: 1813,
lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1372 (compressed)
rmgr: Gin len (rec/tot): 40/ 72, tx: 1813,
lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T
...
rmgr: Gin len (rec/tot): 118/ 150, tx: 1813,
lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1374 (compressed)
...
rmgr: Gin len (rec/tot): 288/ 320, tx: 1813,
lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
length: 1544 (compressed)

Then the size decreases to 66B and is gradually increasing again up to 320B.
This increase and decrease of WAL size seems to continue.

[In 9.3]
At first, the size of GIN-related WAL is gradually increasing up to about 2700B.

rmgr: Gin len (rec/tot): 52/ 84, tx: 1812,
lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
isdelete F updateBlkno:4294967295
rmgr: Gin len (rec/tot): 60/ 92, tx: 1812,
lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
...
rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812,
lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812,
lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
tree, node: 1663/12896/16441 blkno: 6

The size decreases to 66B and then is never changed.

rmgr: Gin len (rec/tot): 34/ 66, tx: 1812,
lsn: 0/026D9F00, prev 0/026D9EB8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 6 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295
rmgr: Gin len (rec/tot): 34/ 66, tx: 1812,
lsn: 0/026D9F48, prev 0/026D9F00, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 7 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295
rmgr: Gin len (rec/tot): 34/ 66, tx: 1812,
lsn: 0/026D9F90, prev 0/026D9F48, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 8 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295
...

This difference in GIN-related WAL size seems to cause HEAD to generate more
than two times bigger WAL. Unfortunately the gap of WAL size would be
continuously increasing :(

That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...

Let me know if you want me to write patch addressing this issue.

Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Fujii Masao (#4)
Re: HEAD seems to generate larger WAL regarding GIN index

Fujii Masao escribi�:

On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...

Let me know if you want me to write patch addressing this issue.

Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.

Users that disable FASTUPDATE, in my experience, do so because their
stock work_mem is way too high and GIN searches become too slow due to
having to scan too large a list. I think it might make sense to invest
a modest amount of time in getting FASTUPDATE to be sized completely
differently from today -- perhaps base it on a hardcoded factor of
BLCKSZ, rather than work_mem. Or, if we really need to make it
configurable, then let it have its own parameter.

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Fujii Masao (#4)
Re: HEAD seems to generate larger WAL regarding GIN index

On 03/17/2014 03:20 PM, Fujii Masao wrote:

On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
and then found its max seems more than 256B. Am I missing something?

What I observed is

[In HEAD]
At first, the size of GIN-related WAL is gradually increasing up to about 1400B.
rmgr: Gin len (rec/tot): 48/ 80, tx: 1813,
lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F
rmgr: Gin len (rec/tot): 56/ 88, tx: 1813,
lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
rmgr: Gin len (rec/tot): 64/ 96, tx: 1813,
lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
...
rmgr: Gin len (rec/tot): 1376/ 1408, tx: 1813,
lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T
rmgr: Gin len (rec/tot): 1392/ 1424, tx: 1813,
lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
tree, node: 1663/12945/16441 blkno: 4

This corresponds to the stage where the items are stored in-line in the
entry-tree. After it reaches a certain size, a posting tree is created.

Then the size decreases to about 100B and is gradually increasing
again up to 320B.

rmgr: Gin len (rec/tot): 116/ 148, tx: 1813,
lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1372 (compressed)
rmgr: Gin len (rec/tot): 40/ 72, tx: 1813,
lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T
...
rmgr: Gin len (rec/tot): 118/ 150, tx: 1813,
lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1374 (compressed)
...
rmgr: Gin len (rec/tot): 288/ 320, tx: 1813,
lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
length: 1544 (compressed)

Then the size decreases to 66B and is gradually increasing again up to 320B.
This increase and decrease of WAL size seems to continue.

Here the new items are appended to posting tree pages. This is where the
maximum of 256 bytes I mentioned applies. 256 bytes is the max size of
one compressed posting list, the WAL record containing it includes some
other stuff too, which adds up to that 320 bytes.

[In 9.3]
At first, the size of GIN-related WAL is gradually increasing up to about 2700B.

rmgr: Gin len (rec/tot): 52/ 84, tx: 1812,
lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
isdelete F updateBlkno:4294967295
rmgr: Gin len (rec/tot): 60/ 92, tx: 1812,
lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
...
rmgr: Gin len (rec/tot): 2740/ 2772, tx: 1812,
lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295
rmgr: Gin len (rec/tot): 2714/ 2746, tx: 1812,
lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
tree, node: 1663/12896/16441 blkno: 6

The size decreases to 66B and then is never changed.

Same mechanism on 9.3, but the insertions to the posting tree pages are
constant size.

That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...

Let me know if you want me to write patch addressing this issue.

Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.

Ok, let's think about it a little bit. I think there are three fairly
simple ways to address this:

1. The GIN data leaf "recompress" record contains an offset called
"unmodifiedlength", and the data that comes after that offset.
Currently, the record is written so that unmodifiedlength points to the
end of the last compressed posting list stored on the page that was not
modified, followed by all the modified ones. The straightforward way to
cut down the WAL record size would be to be more fine-grained than that,
and for the posting lists that were modified, only store the difference
between the old and new version.

To make this approach work well for random insertions, not just
appending to the end, we would also need to make the logic in
leafRepackItems a bit smarter so that it would not re-encode all the
posting lists, after the first modified one.

2. Instead of storing the new compressed posting list in the WAL record,
store only the new item pointers added to the page. WAL replay would
then have to duplicate the work done in the main insertion code path:
find the right posting lists to insert to, decode them, add the new
items, and re-encode.

The upside of that would be that the WAL format would be very compact.
It would be quite simple to implement - you just need to call the same
functions we use in the main insertion codepath to insert the new items.
It could be more expensive, CPU-wise, to replay the records, however.

This record format would be higher-level, in the sense that we would not
store the physical copy of the compressed posting list as it was formed
originally. The same work would be done at WAL replay. As the code
stands, it will produce exactly the same result, but that's not
guaranteed if we make bugfixes to the code later, and a master and
standby are running different minor version. There's not necessarily
anything wrong with that, but it's something to keep in mind.

3. Just reduce the GinPostingListSegmentMaxSize constant from 256, to
say 128. That would halve the typical size of a WAL record that appends
to the end. However, it would not help with insertions in the middle of
a posting list, only appends to the end, and it would bloat the pages
somewhat, as you would waste more space on the posting list headers.

I'm leaning towards option 2. Alexander, what do you think?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Heikki Linnakangas (#6)
Re: HEAD seems to generate larger WAL regarding GIN index

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

2. Instead of storing the new compressed posting list in the WAL record,
store only the new item pointers added to the page. WAL replay would
then have to duplicate the work done in the main insertion code path:
find the right posting lists to insert to, decode them, add the new
items, and re-encode.

That sounds fairly dangerous ... is any user-defined code involved in
those decisions?

This record format would be higher-level, in the sense that we would not
store the physical copy of the compressed posting list as it was formed
originally. The same work would be done at WAL replay. As the code
stands, it will produce exactly the same result, but that's not
guaranteed if we make bugfixes to the code later, and a master and
standby are running different minor version. There's not necessarily
anything wrong with that, but it's something to keep in mind.

Version skew would be a hazard too, all right. I think it's important
that WAL replay be a pretty mechanical, predictable process.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#7)
Re: HEAD seems to generate larger WAL regarding GIN index

On 03/17/2014 04:33 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

2. Instead of storing the new compressed posting list in the WAL record,
store only the new item pointers added to the page. WAL replay would
then have to duplicate the work done in the main insertion code path:
find the right posting lists to insert to, decode them, add the new
items, and re-encode.

That sounds fairly dangerous ... is any user-defined code involved in
those decisions?

No.

This record format would be higher-level, in the sense that we would not
store the physical copy of the compressed posting list as it was formed
originally. The same work would be done at WAL replay. As the code
stands, it will produce exactly the same result, but that's not
guaranteed if we make bugfixes to the code later, and a master and
standby are running different minor version. There's not necessarily
anything wrong with that, but it's something to keep in mind.

Version skew would be a hazard too, all right. I think it's important
that WAL replay be a pretty mechanical, predictable process.

Yeah. One particular point to note is that if in one place we do the
more "high level" thing and have WAL replay re-encode the page as it
sees fit, then we can *not* rely on the page being byte-by-byte
identical in other places. Like, in vacuum, where items are deleted.

Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct
the page, instead of making a physical copy of the modified parts. And
_bt_restore_page even inserts the items physically in different order
than the normal codepath does. So for good or bad, there is some
precedence for this.

The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master
server adds an item to a page, and it just fits, but with the
compression logic the standby version has, it cannot make it fit. As an
escape hatch for that, we could have the WAL replay code try the
compression again, with a larger max. posting list size, if it doesn't
fit at first. And/or always leave something like 10 bytes of free space
on every data page to make up for small differences in the logic.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Heikki Linnakangas (#8)
Re: HEAD seems to generate larger WAL regarding GIN index

On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the
page, instead of making a physical copy of the modified parts. And
_bt_restore_page even inserts the items physically in different order than
the normal codepath does. So for good or bad, there is some precedence for
this.

Yikes.

The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.

That scares the crap out of me. I don't see any intrinsic problem
with relying on the existence page contents to figure out how to roll
forward, as PageAddItem does; after all, we do FPIs precisely so that
the page is in a known good state when we start. However, I really
think we ought to try hard to make this deterministic in terms of what
the resulting state of the page is; anything else seems like it's
playing with fire, and I bet we'll get burned sooner rather than
later.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#9)
Re: HEAD seems to generate larger WAL regarding GIN index

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the
page, instead of making a physical copy of the modified parts. And
_bt_restore_page even inserts the items physically in different order than
the normal codepath does. So for good or bad, there is some precedence for
this.

Yikes.

Yeah. I think it's arguably a bug that _bt_restore_page works like that,
even though we've not been burnt up to now.

The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.

That scares the crap out of me.

Likewise. Saving some WAL space is *not* worth this kind of risk.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Tom Lane (#10)
Re: HEAD seems to generate larger WAL regarding GIN index

On 03/17/2014 05:35 PM, Tom Lane wrote:

Robert Haas <robertmhaas@gmail.com> writes:

On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

The imminent danger I see is if we change the logic on how the items are
divided into posting lists, and end up in a situation where a master server
adds an item to a page, and it just fits, but with the compression logic the
standby version has, it cannot make it fit. As an escape hatch for that, we
could have the WAL replay code try the compression again, with a larger max.
posting list size, if it doesn't fit at first. And/or always leave something
like 10 bytes of free space on every data page to make up for small
differences in the logic.

That scares the crap out of me.

Likewise. Saving some WAL space is *not* worth this kind of risk.

One fairly good compromise would be to only include the new items, not
the whole modified compression lists, and let the replay logic do the
re-encoding of the posting lists. But also include the cutoff points of
each posting list in the WAL record. That way the replay code would have
no freedom in how it decides to split the items into compressed lists,
that would be fully specified by the WAL record.

Here's a refresher for those who didn't follow the development of the
new page format: The data page basically contains a list of
ItemPointers. The items are compressed, to save disk space. However, to
make random access faster, all the items on the page are not compressed
as one big list. Instead, the big array of items is split into roughly
equal chunks, and each chunk is compressed separately. The chunks are
stored on the page one after each other. (The chunks are called "posting
lists" in the code, the struct is called GinPostingListData)

The compression is completely deterministic (each item is stored as a
varbyte-encoded delta from the previous item), but there are no hard
rules on how the items on the page ought to be divided into the posting
lists. Currently, the code tries to maintain a max size of 256 bytes per
list - but it will cope with any size it finds on disk. This is where
the danger lies, where we could end up with a different physical page
after WAL replay, if we just include the new items in the WAL record.
The WAL replay might decide to split the items into posting lists
differently than was originally done. (as the code stands, it would
always make the same decision, completely deterministically, but that
might change in a minor version if we're not careful)

We can tie WAL replay's hands about that, if we include a list of items
that form the posting lists in the WAL record. That adds some bloat,
compared to only including the new items, but not too much. (and we
still only need do that for posting lists following the first modified one.)

Alexander, would you like to give that a shot, or will I?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Fujii Masao
masao.fujii@gmail.com
In reply to: Alvaro Herrera (#5)
Re: HEAD seems to generate larger WAL regarding GIN index

On Mon, Mar 17, 2014 at 10:44 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:

Fujii Masao escribió:

On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:

That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...

Let me know if you want me to write patch addressing this issue.

Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.

Users that disable FASTUPDATE, in my experience, do so because their
stock work_mem is way too high and GIN searches become too slow due to
having to scan too large a list.

Yes. Another reason that I've heard from users so far is that
the size of GIN index with FASTUPDATE=off is likely to be smaller
than that with FASTUPDATE=on.

I think it might make sense to invest
a modest amount of time in getting FASTUPDATE to be sized completely
differently from today -- perhaps base it on a hardcoded factor of
BLCKSZ, rather than work_mem. Or, if we really need to make it
configurable, then let it have its own parameter.

I prefer to have the parameter. When users create multiple GIN indexes
for various uses, they might want to use different thresholds of the pending
list for each index. So, GIN index parameter might be better than GUC one.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Jesper Krogh
jesper@krogh.cc
In reply to: Heikki Linnakangas (#2)
Re: HEAD seems to generate larger WAL regarding GIN index

On 15/03/14 20:27, Heikki Linnakangas wrote:

That said, I didn't expect the difference to be quite that big when
you're appending to the end of the table. When the new entries go to
the end of the posting lists, you only need to recompress and WAL-log
the last posting list, which is max 256 bytes long. But I guess that's
still a lot more WAL than in the old format.

That could be optimized, but I figured we can live with it, thanks to
the fastupdate feature. Fastupdate allows amortizing that cost over
several insertions. But of course, you explicitly disabled that...

In a concurrent update environment, fastupdate as it is in 9.2 is not
really useful. It may be that you can bulk up insertion, but you have no
control over who ends up paying the debt. Doubling the amount of wal
from gin-indexing would be pretty tough for us, in 9.2 we generate
roughly 1TB wal / day, keeping it
for some weeks to be able to do PITR. The wal are mainly due to
gin-index updates as new data is added and needs to be searchable by
users. We do run gzip that cuts it down to 25-30% before keeping the for
too long, but doubling this is going to be a migration challenge.

If fast-update could be made to work in an environment where we both
have users searching the index and manually updating it and 4+ backend
processes updating the index concurrently then it would be a good
benefit to gain.

the gin index currently contains 70+ million records with and average
tsvector of 124 terms.

--
Jesper .. trying to add some real-world info.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Heikki Linnakangas
hlinnakangas@vmware.com
In reply to: Heikki Linnakangas (#11)
1 attachment(s)
Re: HEAD seems to generate larger WAL regarding GIN index

I came up with the attached patch, to reduce the WAL volume of GIN
insertions. It become fairly large, but I guess that's not too
surprising as the old WAL-logging method was basically to dump the whole
page to WAL record. This is now a lot more fine-grained and smarter. I
separated constructing the WAL record from copying the changes back to
the disk page, which IMHO is a readability improvement even though it's
more code.

There are two parts to this patch:

* leafRepackItems has been rewritten. The previous coding basically
searched for the first modified item, and decoded and re-encoded
everything on the page that after that. Now it tries harder to avoid
re-encoding segments that are still reasonably sized (between 128 and
384 bytes, with the target for new segments being 256 bytes). This ought
to make random updates faster as a bonus, but I didn't performance test
that.

* Track more carefully which segments on the page have been modified.
The in-memory structure used to manipulate a page now keeps an action
code for each segment, indicating if the segment is completely new,
deleted, or replaced with new content, or if just some new items have
been added to it. These same actions are WAL-logged, and replayed in the
redo routine.

This brings the WAL volume back to the same ballpark as 9.3. Or better,
depending on the operation.

Fujii, Alexander, how does this look to you?

- Heikki

Attachments:

gin-more-compact-wal-1.patchtext/x-diff; name=gin-more-compact-wal-1.patchDownload
diff --git a/src/backend/access/gin/gindatapage.c b/src/backend/access/gin/gindatapage.c
index 313d53c..56b456e 100644
--- a/src/backend/access/gin/gindatapage.c
+++ b/src/backend/access/gin/gindatapage.c
@@ -22,17 +22,17 @@
 #include "utils/rel.h"
 
 /*
- * Size of the posting lists stored on leaf pages, in bytes. The code can
- * deal with any size, but random access is more efficient when a number of
- * smaller lists are stored, rather than one big list.
- */
-#define GinPostingListSegmentMaxSize 256
-
-/*
- * Existing posting lists smaller than this are recompressed, when inserting
- * new items to page.
+ * Min, Max and Target size of posting lists stored on leaf pages, in bytes.
+ *
+ * The code can deal with any size, but random access is more efficient when
+ * a number of smaller lists are stored, rather than one big list. If a
+ * posting list would become larger than Max size as a result of insertions,
+ * it is split into two. If a posting list would be smaller than minimum
+ * size
  */
-#define GinPostingListSegmentMinSize 192
+#define GinPostingListSegmentMaxSize 384
+#define GinPostingListSegmentTargetSize 256
+#define GinPostingListSegmentMinSize 128
 
 /*
  * At least this many items fit in a GinPostingListSegmentMaxSize-bytes
@@ -55,12 +55,32 @@ typedef struct
 	dlist_node *lastleft;		/* last segment on left page */
 	int			lsize;			/* total size on left page */
 	int			rsize;			/* total size on right page */
+
+	bool		oldformat;		/* page is in pre-9.4 on disk */
 } disassembledLeaf;
 
 typedef struct
 {
 	dlist_node	node;		/* linked list pointers */
 
+	/*-------------
+	 * 'action' indicates the status of this in-memory segment, compared to
+	 * what's on disk. It is one of the GIN_SEGMENT_* action codes:
+	 *
+	 * UNMODIFIED	no changes
+	 * DELETED		the segment is to be removed. 'seg' and 'items' are
+	 *				ignored
+	 * INSERT		this is a completely new segment
+	 * REPLACE		this replaces an existing segment with new content
+	 * ADDITEMS		like REPLACE, but we track in detail what items have
+	 *				been added to this segment, in 'modifieditems'
+	 *-------------
+	 */
+	char		action;
+
+	ItemPointerData *modifieditems;
+	int			nmodifieditems;
+
 	/*
 	 * The following fields represent the items in this segment.
 	 * If 'items' is not NULL, it contains a palloc'd array of the items
@@ -72,8 +92,6 @@ typedef struct
 	GinPostingList *seg;
 	ItemPointer items;
 	int			nitems;			/* # of items in 'items', if items != NULL */
-
-	bool		modified;		/* is this segment on page already? */
 } leafSegmentInfo;
 
 static ItemPointer dataLeafPageGetUncompressed(Page page, int *nitems);
@@ -87,9 +105,9 @@ static bool leafRepackItems(disassembledLeaf *leaf, ItemPointer remaining);
 static bool addItemsToLeaf(disassembledLeaf *leaf, ItemPointer newItems, int nNewItems);
 
 
-static void dataPlaceToPageLeafRecompress(Buffer buf,
-							  disassembledLeaf *leaf,
-							  XLogRecData **prdata);
+static XLogRecData *constructLeafRecompressWALData(Buffer buf,
+							   disassembledLeaf *leaf);
+static void dataPlaceToPageLeafRecompress(Buffer buf, disassembledLeaf *leaf);
 static void dataPlaceToPageLeafSplit(Buffer buf,
 						 disassembledLeaf *leaf,
 						 ItemPointerData lbound, ItemPointerData rbound,
@@ -563,15 +581,21 @@ dataPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 	if (!needsplit)
 	{
 		/*
-		 * Great, all the items fit on a single page. Write the segments to
-		 * the page, and WAL-log appropriately.
+		 * Great, all the items fit on a single page. Construct a WAL record
+		 * describing the changes we made, and write the segments back to
+		 * the page.
 		 *
 		 * Once we start modifying the page, there's no turning back. The
 		 * caller is responsible for calling END_CRIT_SECTION() after writing
 		 * the WAL record.
 		 */
+		MemoryContextSwitchTo(oldCxt);
+		if (RelationNeedsWAL(btree->index))
+			*prdata = constructLeafRecompressWALData(buf, leaf);
+		else
+			*prdata = NULL;
 		START_CRIT_SECTION();
-		dataPlaceToPageLeafRecompress(buf, leaf, prdata);
+		dataPlaceToPageLeafRecompress(buf, leaf);
 
 		if (append)
 			elog(DEBUG2, "appended %d new items to block %u; %d bytes (%d to go)",
@@ -619,8 +643,6 @@ dataPlaceToPageLeaf(GinBtree btree, Buffer buf, GinBtreeStack *stack,
 						break;
 				}
 
-				 /* don't consider segments moved to right as unmodified */
-				lastleftinfo->modified = true;
 				leaf->lsize -= segsize;
 				leaf->rsize += segsize;
 				leaf->lastleft = dlist_prev_node(&leaf->segments, leaf->lastleft);
@@ -716,14 +738,15 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 				/* Removing an item never increases the size of the segment */
 				if (npacked != ncleaned)
 					elog(ERROR, "could not fit vacuumed posting list");
+				seginfo->action = GIN_SEGMENT_REPLACE;
 			}
 			else
 			{
 				seginfo->seg = NULL;
 				seginfo->items = NULL;
+				seginfo->action = GIN_SEGMENT_DELETE;
 			}
 			seginfo->nitems = ncleaned;
-			seginfo->modified = true;
 
 			removedsomething = true;
 		}
@@ -748,10 +771,33 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 	 */
 	if (removedsomething)
 	{
-		XLogRecData *payloadrdata;
+		XLogRecData *payloadrdata = NULL;
+		bool		modified;
 
+		/*
+		 * Make sure we have a palloc'd copy of all segments, after the
+		 * first segment that is modified. (dataPlaceToPageLeafRecompress
+		 * requires this).
+		 */
+		modified = false;
+		dlist_foreach(iter, &leaf->segments)
+		{
+			leafSegmentInfo *seginfo = dlist_container(leafSegmentInfo, node, iter.cur);
+			if (seginfo->action != GIN_SEGMENT_UNMODIFIED)
+				modified = true;
+			if (modified && seginfo->action != GIN_SEGMENT_DELETE)
+			{
+				int			segsize = SizeOfGinPostingList(seginfo->seg);
+				GinPostingList *tmp = (GinPostingList *) palloc(segsize);
+				memcpy(tmp, seginfo->seg, segsize);
+				seginfo->seg = tmp;
+			}
+		}
+
+		if (RelationNeedsWAL(indexrel))
+			payloadrdata = constructLeafRecompressWALData(buffer, leaf);
 		START_CRIT_SECTION();
-		dataPlaceToPageLeafRecompress(buffer, leaf, &payloadrdata);
+		dataPlaceToPageLeafRecompress(buffer, leaf);
 
 		MarkBufferDirty(buffer);
 
@@ -778,96 +824,168 @@ ginVacuumPostingTreeLeaf(Relation indexrel, Buffer buffer, GinVacuumState *gvs)
 }
 
 /*
+ * Construct a ginxlogRecompressDataLeaf record representing the changes
+ * in *leaf.
+ */
+static XLogRecData *
+constructLeafRecompressWALData(Buffer buf, disassembledLeaf *leaf)
+{
+	int			nmodified = 0;
+	char	   *walbufbegin;
+	char	   *walbufend;
+	XLogRecData *rdata;
+	dlist_iter	iter;
+	int			segno;
+	ginxlogRecompressDataLeaf *recompress_xlog;
+
+	/* Count the modified segments */
+	dlist_foreach(iter, &leaf->segments)
+	{
+		leafSegmentInfo *seginfo = dlist_container(leafSegmentInfo, node,
+												   iter.cur);
+
+		if (seginfo->action != GIN_SEGMENT_UNMODIFIED)
+			nmodified++;
+	}
+
+	walbufbegin = palloc(
+		sizeof(ginxlogRecompressDataLeaf) +
+		BLCKSZ + 			/* max size needed to hold the segment data */
+		nmodified * 2 +		/* (segno + action) per action */
+		sizeof(XLogRecData));
+	walbufend = walbufbegin;
+
+	recompress_xlog = (ginxlogRecompressDataLeaf *) walbufend;
+	walbufend += sizeof(ginxlogRecompressDataLeaf);
+
+	recompress_xlog->nactions = nmodified;
+
+	segno = 0;
+	dlist_foreach(iter, &leaf->segments)
+	{
+		leafSegmentInfo *seginfo = dlist_container(leafSegmentInfo, node, iter.cur);
+		int			segsize = 0;
+		int			datalen;
+		uint8		action = seginfo->action;
+
+		if (action == GIN_SEGMENT_UNMODIFIED)
+		{
+			segno++;
+			continue;
+		}
+
+		if (action != GIN_SEGMENT_DELETE)
+			segsize = SizeOfGinPostingList(seginfo->seg);
+
+		/*
+		 * If storing the uncompressed list of added item pointers would take
+		 * more space than storing the compressed segment as is, do that
+		 * instead.
+		 */
+		if (action == GIN_SEGMENT_ADDITEMS &&
+			seginfo->nmodifieditems * sizeof(ItemPointerData) > segsize)
+		{
+			action = GIN_SEGMENT_REPLACE;
+		}
+
+		*((uint8 *) (walbufend++)) = segno;
+		*(walbufend++) = action;
+
+		switch (action)
+		{
+			case GIN_SEGMENT_DELETE:
+				datalen = 0;
+				break;
+
+			case GIN_SEGMENT_ADDITEMS:
+				datalen = seginfo->nmodifieditems * sizeof(ItemPointerData);
+				*((uint16 *) walbufend) = seginfo->nmodifieditems;
+				memcpy(walbufend + sizeof(uint16), seginfo->modifieditems, datalen);
+				datalen += sizeof(uint16);
+				break;
+
+			case GIN_SEGMENT_INSERT:
+			case GIN_SEGMENT_REPLACE:
+				datalen = SHORTALIGN(segsize);
+				memcpy(walbufend, seginfo->seg, segsize);
+				break;
+
+			default:
+				elog(ERROR, "unexpected GIN leaf action %d", action);
+		}
+		walbufend += datalen;
+
+		if (action != GIN_SEGMENT_INSERT)
+			segno++;
+	}
+
+	rdata = (XLogRecData *) MAXALIGN(walbufend);
+	rdata->buffer = buf;
+	rdata->buffer_std = TRUE;
+	rdata->data = walbufbegin;
+	rdata->len = walbufend - walbufbegin;
+	rdata->next = NULL;
+
+	return rdata;
+}
+
+/*
  * Assemble a disassembled posting tree leaf page back to a buffer.
  *
  * *prdata is filled with WAL information about this operation. The caller
  * is responsible for inserting to the WAL, along with any other information
  * about the operation that triggered this recompression.
  *
- * NOTE: The segment pointers can point directly to the same buffer, with
- * the limitation that any earlier segment must not overlap with an original,
- * later segment. In other words, some segments may point the original buffer
- * as long as you don't make any segments larger. Currently, leafRepackItems
- * satisies this rule because it rewrites all segments after the first
- * modified one, and vacuum can only make segments shorter.
+ * NOTE: The segment pointers must not point directly to the same buffer,
+ * except for segments that have not been modified, and none of the preceding
+ * segments have been modified either.
  */
 static void
-dataPlaceToPageLeafRecompress(Buffer buf, disassembledLeaf *leaf,
-							  XLogRecData **prdata)
+dataPlaceToPageLeafRecompress(Buffer buf, disassembledLeaf *leaf)
 {
 	Page		page = BufferGetPage(buf);
 	char	   *ptr;
 	int			newsize;
-	int			unmodifiedsize;
 	bool		modified = false;
 	dlist_iter	iter;
+	int			segsize;
+
 	/*
-	 * these must be static so they can be returned to caller (no pallocs
-	 * since we're in a critical section!)
+	 * If the page was in pre-9.4 format before, convert the header, and
+	 * force all segments to be copied to the page whether they were modified
+	 * or not.
 	 */
-	static ginxlogRecompressDataLeaf recompress_xlog;
-	static XLogRecData rdata[2];
+	if (!GinPageIsCompressed(page))
+	{
+		Assert(leaf->oldformat);
+		GinPageSetCompressed(page);
+		GinPageGetOpaque(page)->maxoff = InvalidOffsetNumber;
+		modified = true;
+	}
 
 	ptr = (char *) GinDataLeafPageGetPostingList(page);
 	newsize = 0;
-	unmodifiedsize = 0;
-	modified = false;
 	dlist_foreach(iter, &leaf->segments)
 	{
 		leafSegmentInfo *seginfo = dlist_container(leafSegmentInfo, node, iter.cur);
-		int			segsize;
 
-		if (seginfo->modified)
+		if (seginfo->action != GIN_SEGMENT_UNMODIFIED)
 			modified = true;
 
-		/*
-		 * Nothing to do with empty segments, except keep track if they've been
-		 * modified
-		 */
-		if (seginfo->seg == NULL)
-		{
-			Assert(seginfo->items == NULL);
-			continue;
-		}
+		if (seginfo->action != GIN_SEGMENT_DELETE)
+			segsize = SizeOfGinPostingList(seginfo->seg);
+		else
+			segsize = 0;
 
-		segsize = SizeOfGinPostingList(seginfo->seg);
+		if (modified)
+			memcpy(ptr, seginfo->seg, segsize);
 
-		if (!modified)
-			unmodifiedsize += segsize;
-		else
-		{
-			/*
-			 * Use memmove rather than memcpy, in case the segment points
-			 * to the same buffer
-			 */
-			memmove(ptr, seginfo->seg, segsize);
-		}
 		ptr += segsize;
 		newsize += segsize;
 	}
+
 	Assert(newsize <= GinDataLeafMaxContentSize);
 	GinDataLeafPageSetPostingListSize(page, newsize);
-
-	/* Reset these in case the page was in pre-9.4 format before */
-	GinPageSetCompressed(page);
-	GinPageGetOpaque(page)->maxoff = InvalidOffsetNumber;
-
-	/* Put WAL data */
-	recompress_xlog.length = (uint16) newsize;
-	recompress_xlog.unmodifiedsize = unmodifiedsize;
-
-	rdata[0].buffer = InvalidBuffer;
-	rdata[0].data = (char *) &recompress_xlog;
-	rdata[0].len = offsetof(ginxlogRecompressDataLeaf, newdata);
-	rdata[0].next = &rdata[1];
-
-	rdata[1].buffer = buf;
-	rdata[1].buffer_std = TRUE;
-	rdata[1].data = ((char *) GinDataLeafPageGetPostingList(page)) + unmodifiedsize;
-	rdata[1].len = newsize - unmodifiedsize;
-	rdata[1].next = NULL;
-
-	*prdata = rdata;
 }
 
 /*
@@ -914,9 +1032,12 @@ dataPlaceToPageLeafSplit(Buffer buf, disassembledLeaf *leaf,
 		seginfo = dlist_container(leafSegmentInfo, node, node);
 		segsize = SizeOfGinPostingList(seginfo->seg);
 
-		memcpy(ptr, seginfo->seg, segsize);
-		ptr += segsize;
-		lsize += segsize;
+		if (seginfo->action != GIN_SEGMENT_DELETE)
+		{
+			memcpy(ptr, seginfo->seg, segsize);
+			ptr += segsize;
+			lsize += segsize;
+		}
 	}
 	Assert(lsize == leaf->lsize);
 	GinDataLeafPageSetPostingListSize(lpage, lsize);
@@ -932,9 +1053,12 @@ dataPlaceToPageLeafSplit(Buffer buf, disassembledLeaf *leaf,
 		seginfo = dlist_container(leafSegmentInfo, node, node);
 		segsize = SizeOfGinPostingList(seginfo->seg);
 
-		memcpy(ptr, seginfo->seg, segsize);
-		ptr += segsize;
-		rsize += segsize;
+		if (seginfo->action != GIN_SEGMENT_DELETE)
+		{
+			memcpy(ptr, seginfo->seg, segsize);
+			ptr += segsize;
+			rsize += segsize;
+		}
 
 		if (!dlist_has_next(&leaf->segments, node))
 			break;
@@ -1195,14 +1319,15 @@ disassembleLeaf(Page page)
 		{
 			leafSegmentInfo *seginfo = palloc(sizeof(leafSegmentInfo));
 
+			seginfo->action = GIN_SEGMENT_UNMODIFIED;
 			seginfo->seg = seg;
 			seginfo->items = NULL;
 			seginfo->nitems = 0;
-			seginfo->modified = false;
 			dlist_push_tail(&leaf->segments, &seginfo->node);
 
 			seg = GinNextPostingListSegment(seg);
 		}
+		leaf->oldformat = false;
 	}
 	else
 	{
@@ -1218,14 +1343,15 @@ disassembleLeaf(Page page)
 
 		seginfo = palloc(sizeof(leafSegmentInfo));
 
+		seginfo->action = GIN_SEGMENT_REPLACE;
 		seginfo->seg = NULL;
 		seginfo->items = palloc(nuncompressed * sizeof(ItemPointerData));
 		memcpy(seginfo->items, uncompressed, nuncompressed * sizeof(ItemPointerData));
 		seginfo->nitems = nuncompressed;
-		/* make sure we rewrite this to disk */
-		seginfo->modified = true;
 
 		dlist_push_tail(&leaf->segments, &seginfo->node);
+
+		leaf->oldformat = true;
 	}
 
 	return leaf;
@@ -1247,6 +1373,7 @@ addItemsToLeaf(disassembledLeaf *leaf, ItemPointer newItems, int nNewItems)
 	ItemPointer nextnew = newItems;
 	int			newleft = nNewItems;
 	bool		modified = false;
+	leafSegmentInfo *newseg;
 
 	/*
 	 * If the page is completely empty, just construct one new segment to
@@ -1254,14 +1381,12 @@ addItemsToLeaf(disassembledLeaf *leaf, ItemPointer newItems, int nNewItems)
 	 */
 	if (dlist_is_empty(&leaf->segments))
 	{
-		/* create a new segment for the new entries */
-		leafSegmentInfo *seginfo = palloc(sizeof(leafSegmentInfo));
-
-		seginfo->seg = NULL;
-		seginfo->items = newItems;
-		seginfo->nitems = nNewItems;
-		seginfo->modified = true;
-		dlist_push_tail(&leaf->segments, &seginfo->node);
+		newseg = palloc(sizeof(leafSegmentInfo));
+		newseg->seg = NULL;
+		newseg->items = newItems;
+		newseg->nitems = nNewItems;
+		newseg->action = GIN_SEGMENT_INSERT;
+		dlist_push_tail(&leaf->segments, &newseg->node);
 		return true;
 	}
 
@@ -1303,15 +1428,51 @@ addItemsToLeaf(disassembledLeaf *leaf, ItemPointer newItems, int nNewItems)
 		if (!cur->items)
 			cur->items = ginPostingListDecode(cur->seg, &cur->nitems);
 
+		/*
+		 * Fast path for the important special case that we're appending to
+		 * the end of the page: don't let the last segment on the page grow
+		 * larger than the target, create a new segment before that happens.
+		 */
+		if (!dlist_has_next(&leaf->segments, iter.cur) &&
+			ginCompareItemPointers(&cur->items[cur->nitems - 1], &nextnew[0]) > 0 &&
+			cur->seg != NULL &&
+			SizeOfGinPostingList(cur->seg) >= GinPostingListSegmentTargetSize)
+		{
+			newseg = palloc(sizeof(leafSegmentInfo));
+			newseg->seg = NULL;
+			newseg->items = nextnew;
+			newseg->nitems = nthis;
+			newseg->action = GIN_SEGMENT_INSERT;
+			dlist_push_tail(&leaf->segments, &newseg->node);
+			modified = true;
+			break;
+		}
+
 		tmpitems = ginMergeItemPointers(cur->items, cur->nitems,
 										nextnew, nthis,
 										&ntmpitems);
 		if (ntmpitems != cur->nitems)
 		{
+			/*
+			 * If there are no duplicates, track the added items so that we
+			 * can emit a compact ADDITEMS WAL record later on. (it doesn't
+			 * seem worth re-checking which items were duplicates, if there
+			 * were any)
+			 */
+			if (ntmpitems == nthis + cur->nitems &&
+				cur->action == GIN_SEGMENT_UNMODIFIED)
+			{
+				cur->action = GIN_SEGMENT_ADDITEMS;
+				cur->modifieditems = nextnew;
+				cur->nmodifieditems = nthis;
+			}
+			else
+				cur->action = GIN_SEGMENT_REPLACE;
+
 			cur->items = tmpitems;
 			cur->nitems = ntmpitems;
 			cur->seg = NULL;
-			modified = cur->modified = true;
+			modified = true;
 		}
 
 		nextnew += nthis;
@@ -1331,133 +1492,160 @@ addItemsToLeaf(disassembledLeaf *leaf, ItemPointer newItems, int nNewItems)
  * If all items fit, *remaining is set to invalid.
  *
  * Returns true if the page has to be split.
- *
- * XXX: Actually, this re-encodes all segments after the first one that was
- * modified, to make sure the new segments are all more or less of equal
- * length. That's unnecessarily aggressive; if we've only added a single item
- * to one segment, for example, we could re-encode just that single segment,
- * as long as it's still smaller than, say, 2x the normal segment size.
  */
 static bool
 leafRepackItems(disassembledLeaf *leaf, ItemPointer remaining)
 {
-	dlist_iter	iter;
-	ItemPointer allitems;
-	int			nallitems;
 	int			pgused = 0;
-	bool		needsplit;
-	int			totalpacked;
-	dlist_mutable_iter miter;
-	dlist_head	recode_list;
-	int			nrecode;
-	bool		recoding;
+	bool		needsplit = false;
+	dlist_iter	iter;
+	int			segsize;
+	leafSegmentInfo *nextseg;
+	int			npacked;
+	bool		modified;
+	dlist_node *cur_node;
+	dlist_node *next_node;
 
 	ItemPointerSetInvalid(remaining);
 
 	/*
-	 * Find the first segment that needs to be re-coded. Move all segments
-	 * that need recoding to separate list, and count the total number of
-	 * items in them. Also, add up the number of bytes in unmodified segments
-	 * (pgused).
+	 * cannot use dlist_foreach_modify here because we insert adjacent
+	 * items while iterating.
 	 */
-	dlist_init(&recode_list);
-	recoding = false;
-	nrecode = 0;
-	pgused = 0;
-	dlist_foreach_modify(miter, &leaf->segments)
+	for (cur_node = dlist_head_node(&leaf->segments);
+		 cur_node != NULL;
+		 cur_node = next_node)
 	{
-		leafSegmentInfo *seginfo = dlist_container(leafSegmentInfo, node, miter.cur);
+		leafSegmentInfo *seginfo = dlist_container(leafSegmentInfo, node, cur_node);
 
-		/* If the segment was modified, re-encode it */
-		if (seginfo->modified || seginfo->seg == NULL)
-			recoding = true;
-		/*
-		 * Also re-encode abnormally small or large segments. (Vacuum can
-		 * leave behind small segments, and conversion from pre-9.4 format
-		 * can leave behind large segments).
-		 */
-		else if (SizeOfGinPostingList(seginfo->seg) < GinPostingListSegmentMinSize)
-			recoding = true;
-		else if (SizeOfGinPostingList(seginfo->seg) > GinPostingListSegmentMaxSize)
-			recoding = true;
+		if (dlist_has_next(&leaf->segments, cur_node))
+			next_node = dlist_next_node(&leaf->segments, cur_node);
+		else
+			next_node = NULL;
 
-		if (recoding)
+		/* Compress the posting list, if necessary */
+		if (seginfo->action != GIN_SEGMENT_DELETE)
 		{
-			if (!seginfo->items)
-				seginfo->items = ginPostingListDecode(seginfo->seg,
-													  &seginfo->nitems);
-			nrecode += seginfo->nitems;
-
-			dlist_delete(miter.cur);
-			dlist_push_tail(&recode_list, &seginfo->node);
-		}
-		else
-			pgused += SizeOfGinPostingList(seginfo->seg);
-	}
+			if (seginfo->seg == NULL)
+			{
+				if (seginfo->nitems > GinPostingListSegmentMaxSize)
+					npacked = 0; /* no chance that it would fit. */
+				else
+				{
+					seginfo->seg = ginCompressPostingList(seginfo->items,
+														  seginfo->nitems,
+												   GinPostingListSegmentMaxSize,
+														  &npacked);
+				}
+				if (npacked != seginfo->nitems)
+				{
+					/*
+					 * Too large. Compress again to the target size, and create
+					 * a new segment to represent the remaining items. The new
+					 * segment is inserted after this one, so it will be
+					 * processed in the next iteration of this loop.
+					 */
+					if (seginfo->seg)
+						pfree(seginfo->seg);
+					seginfo->seg = ginCompressPostingList(seginfo->items,
+														  seginfo->nitems,
+											   GinPostingListSegmentTargetSize,
+														  &npacked);
+					if (seginfo->action != GIN_SEGMENT_INSERT)
+						seginfo->action = GIN_SEGMENT_REPLACE;
+
+					nextseg = palloc(sizeof(leafSegmentInfo));
+					nextseg->action = GIN_SEGMENT_INSERT;
+					nextseg->seg = NULL;
+					nextseg->items = &seginfo->items[npacked];
+					nextseg->nitems = seginfo->nitems - npacked;
+					next_node = &nextseg->node;
+					dlist_insert_after(cur_node, next_node);
+				}
+			}
 
-	if (nrecode == 0)
-		return false; /* nothing to do */
+			/*
+			 * If the segment is very small, merge it with the next
+			 * segment.
+			 */
+			if (SizeOfGinPostingList(seginfo->seg) < GinPostingListSegmentMinSize && next_node)
+			{
+				int		nmerged;
+
+				nextseg = dlist_container(leafSegmentInfo, node, next_node);
+
+				if (seginfo->items == NULL)
+					seginfo->items = ginPostingListDecode(seginfo->seg,
+														  &seginfo->nitems);
+				if (nextseg->items == NULL)
+					nextseg->items = ginPostingListDecode(nextseg->seg,
+														  &nextseg->nitems);
+				nextseg->items =
+					ginMergeItemPointers(seginfo->items, seginfo->nitems,
+										 nextseg->items, nextseg->nitems,
+										 &nmerged);
+				Assert(nmerged == seginfo->nitems + nextseg->nitems);
+				nextseg->nitems = nmerged;
+				nextseg->seg = NULL;
+
+				nextseg->action = GIN_SEGMENT_REPLACE;
+				nextseg->modifieditems = NULL;
+				nextseg->nmodifieditems = 0;
+
+				if (seginfo->action == GIN_SEGMENT_INSERT)
+				{
+					dlist_delete(cur_node);
+					continue;
+				}
+				else
+				{
+					seginfo->action = GIN_SEGMENT_DELETE;
+					seginfo->seg = NULL;
+				}
+			}
 
-	/*
-	 * Construct one big array of the items that need to be re-encoded.
-	 */
-	allitems = palloc(nrecode * sizeof(ItemPointerData));
-	nallitems = 0;
-	dlist_foreach(iter, &recode_list)
-	{
-		leafSegmentInfo *seginfo = dlist_container(leafSegmentInfo, node, iter.cur);
-		memcpy(&allitems[nallitems], seginfo->items, seginfo->nitems * sizeof(ItemPointerData));
-		nallitems += seginfo->nitems;
-	}
-	Assert(nallitems == nrecode);
+			seginfo->items = NULL;
+			seginfo->nitems = 0;
+		}
 
-	/*
-	 * Start packing the items into segments. Stop when we have consumed
-	 * enough space to fill both pages, or we run out of items.
-	 */
-	totalpacked = 0;
-	needsplit = false;
-	while (totalpacked < nallitems)
-	{
-		leafSegmentInfo *seginfo;
-		int			npacked;
-		GinPostingList *seg;
-		int			segsize;
+		if (seginfo->action == GIN_SEGMENT_DELETE)
+			continue;
 
-		seg = ginCompressPostingList(&allitems[totalpacked],
-									 nallitems - totalpacked,
-									 GinPostingListSegmentMaxSize,
-									 &npacked);
-		segsize = SizeOfGinPostingList(seg);
+		/*
+		 * OK, we now have a compressed version of this segment ready for
+		 * copying to the page. Did we exceed the size that fits on one page?
+		 */
+		segsize = SizeOfGinPostingList(seginfo->seg);
 		if (pgused + segsize > GinDataLeafMaxContentSize)
 		{
 			if (!needsplit)
 			{
 				/* switch to right page */
 				Assert(pgused > 0);
-				leaf->lastleft = dlist_tail_node(&leaf->segments);
+				leaf->lastleft = dlist_prev_node(&leaf->segments, cur_node);
 				needsplit = true;
 				leaf->lsize = pgused;
 				pgused = 0;
 			}
 			else
 			{
-				/* filled both pages */
-				*remaining = allitems[totalpacked];
+				/*
+				 * Filled both pages. The last segment we constructed did not
+				 * fit.
+				 */
+				*remaining = seginfo->seg->first;
+
+				/*
+				 * remove any remaining segments that did not fit from the list.
+				 */
+				while (dlist_has_next(&leaf->segments, cur_node))
+					dlist_delete(dlist_next_node(&leaf->segments, cur_node));
+				dlist_delete(cur_node);
 				break;
 			}
 		}
 
-		seginfo = palloc(sizeof(leafSegmentInfo));
-		seginfo->seg = seg;
-		seginfo->items = &allitems[totalpacked];
-		seginfo->nitems = npacked;
-		seginfo->modified = true;
-
-		dlist_push_tail(&leaf->segments, &seginfo->node);
-
 		pgused += segsize;
-		totalpacked += npacked;
 	}
 
 	if (!needsplit)
@@ -1471,6 +1659,31 @@ leafRepackItems(disassembledLeaf *leaf, ItemPointer remaining)
 	Assert(leaf->lsize <= GinDataLeafMaxContentSize);
 	Assert(leaf->rsize <= GinDataLeafMaxContentSize);
 
+	modified = false;
+	dlist_foreach(iter, &leaf->segments)
+	{
+		leafSegmentInfo *seginfo = dlist_container(leafSegmentInfo, node,  iter.cur);
+
+		if (!modified && seginfo->action != GIN_SEGMENT_UNMODIFIED)
+		{
+			modified = true;
+		}
+		/*
+		 * Need to make a palloc'd copy of every segment after the first
+		 * modified one, because as we start copying items to the original
+		 * page, we might overwrite an existing segment.
+		 */
+		else if (modified && seginfo->action == GIN_SEGMENT_UNMODIFIED)
+		{
+			GinPostingList *tmp;
+
+			segsize = SizeOfGinPostingList(seginfo->seg);
+			tmp = palloc(segsize);
+			memcpy(tmp, seginfo->seg, segsize);
+			seginfo->seg = tmp;
+		}
+	}
+
 	return needsplit;
 }
 
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index c46f7ac..5681e14 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -145,15 +145,159 @@ ginRedoInsertEntry(Buffer buffer, bool isLeaf, BlockNumber rightblkno, void *rda
 static void
 ginRedoRecompress(Page page, ginxlogRecompressDataLeaf *data)
 {
-	Pointer		segment;
-
-	/* Copy the new data to the right place */
-	segment = ((Pointer) GinDataLeafPageGetPostingList(page))
-		+ data->unmodifiedsize;
-	memcpy(segment, data->newdata, data->length - data->unmodifiedsize);
-	GinDataLeafPageSetPostingListSize(page, data->length);
-	GinPageSetCompressed(page);
-	GinPageGetOpaque(page)->maxoff = InvalidOffsetNumber;
+	int			nactions = data->nactions;
+	int			actionno;
+	int			segno;
+	GinPostingList *oldseg;
+	Pointer		segmentend;
+	char	   *walbuf;
+	int			totalsize;
+
+	/*
+	 * If the page is in pre-9.4 format, convert to new format first.
+	 */
+	if (!GinPageIsCompressed(page))
+	{
+		ItemPointer uncompressed = (ItemPointer) GinDataPageGetData(page);
+		int			nuncompressed = GinPageGetOpaque(page)->maxoff;
+		int			npacked;
+		GinPostingList *plist;
+
+		plist = ginCompressPostingList(uncompressed, nuncompressed,
+									   BLCKSZ, &npacked);
+		Assert(npacked == nuncompressed);
+
+		totalsize = SizeOfGinPostingList(plist);
+
+		memcpy(GinDataLeafPageGetPostingList(page), plist, totalsize);
+		GinDataLeafPageSetPostingListSize(page, totalsize);
+		GinPageSetCompressed(page);
+		GinPageGetOpaque(page)->maxoff = InvalidOffsetNumber;
+	}
+
+	oldseg = GinDataLeafPageGetPostingList(page);
+	segmentend = (Pointer) oldseg + GinDataLeafPageGetPostingListSize(page);
+	segno = 0;
+
+	walbuf = ((char *) data) + sizeof(ginxlogRecompressDataLeaf);
+	for (actionno = 0; actionno < nactions; actionno++)
+	{
+		uint8		a_segno = *((uint8 *) (walbuf++));
+		uint8		a_action = *((uint8 *) (walbuf++));
+		GinPostingList *newseg = NULL;
+		int			newsegsize = 0;
+		ItemPointerData *items = NULL;
+		uint16		nitems = 0;
+		ItemPointerData *olditems;
+		int			nolditems;
+		ItemPointerData *newitems;
+		int			nnewitems;
+		int			segsize;
+		Pointer		segptr;
+		int			szleft;
+
+		/* Extract all the information we need from the WAL record */
+		if (a_action == GIN_SEGMENT_INSERT ||
+			a_action == GIN_SEGMENT_REPLACE)
+		{
+			newseg = (GinPostingList *) walbuf;
+			newsegsize = SizeOfGinPostingList(newseg);
+			walbuf += SHORTALIGN(newsegsize);
+		}
+
+		if (a_action == GIN_SEGMENT_ADDITEMS)
+		{
+			memcpy(&nitems, walbuf, sizeof(uint16));
+			walbuf += sizeof(uint16);
+			items = (ItemPointerData *) walbuf;
+			walbuf += nitems * sizeof(ItemPointerData);
+		}
+
+		/* Skip to the segment that this action concerns */
+		Assert(segno <= a_segno);
+		while (segno < a_segno)
+		{
+			oldseg = GinNextPostingListSegment(oldseg);
+			segno++;
+		}
+
+		/*
+		 * ADDITEMS action is handled like REPLACE, but the new segment to
+		 * replace the old one is reconstructed using the old segment from
+		 * disk and the new items from the WAL record.
+		 */
+		if (a_action == GIN_SEGMENT_ADDITEMS)
+		{
+			int			npacked;
+
+			olditems = ginPostingListDecode(oldseg, &nolditems);
+
+			newitems = ginMergeItemPointers(items, nitems,
+											olditems, nolditems,
+											&nnewitems);
+			Assert(nnewitems == nolditems + nitems);
+
+			newseg = ginCompressPostingList(newitems, nnewitems,
+											BLCKSZ, &npacked);
+			Assert(npacked == nnewitems);
+
+			newsegsize = SizeOfGinPostingList(newseg);
+			a_action = GIN_SEGMENT_REPLACE;
+		}
+
+		segptr = (Pointer) oldseg;
+		if (segptr != segmentend)
+			segsize = SizeOfGinPostingList(oldseg);
+		else
+		{
+			/*
+			 * Positioned after the last existing segment. Only INSERTs
+			 * expected here.
+			 */
+			Assert(a_action == GIN_SEGMENT_INSERT);
+			segsize = 0;
+		}
+		szleft = segmentend - segptr;
+
+		switch (a_action)
+		{
+			case GIN_SEGMENT_DELETE:
+				memmove(segptr, segptr + segsize, szleft - segsize);
+				segmentend -= segsize;
+
+				segno++;
+				break;
+
+			case GIN_SEGMENT_INSERT:
+				/* make room for the new segment */
+				memmove(segptr + newsegsize, segptr, szleft);
+				/* copy the new segment in place */
+				memcpy(segptr, newseg, newsegsize);
+				segmentend += newsegsize;
+				segptr += newsegsize;
+				break;
+
+			case GIN_SEGMENT_REPLACE:
+				/* shift the segments that follow */
+				memmove(segptr + newsegsize,
+						segptr + segsize,
+						szleft - segsize);
+				/* copy the replacement segment in place */
+				memcpy(segptr, newseg, newsegsize);
+				segmentend -= segsize;
+				segmentend += newsegsize;
+				segptr += newsegsize;
+				segno++;
+				break;
+
+			default:
+				elog(PANIC, "unexpected GIN leaf action: %u", a_action);
+		}
+		oldseg = (GinPostingList *) segptr;
+	}
+
+	totalsize = segmentend - (Pointer) GinDataLeafPageGetPostingList(page);
+	GinDataLeafPageSetPostingListSize(page, totalsize);
 }
 
 static void
diff --git a/src/backend/access/rmgrdesc/gindesc.c b/src/backend/access/rmgrdesc/gindesc.c
index 1983bcc..aa60c8d 100644
--- a/src/backend/access/rmgrdesc/gindesc.c
+++ b/src/backend/access/rmgrdesc/gindesc.c
@@ -25,6 +25,57 @@ desc_node(StringInfo buf, RelFileNode node, BlockNumber blkno)
 					 node.spcNode, node.dbNode, node.relNode, blkno);
 }
 
+static void
+desc_recompress_leaf(StringInfo buf, ginxlogRecompressDataLeaf *insertData)
+{
+	int			i;
+	char	   *walbuf = ((char *) insertData) + sizeof(ginxlogRecompressDataLeaf);
+
+	appendStringInfo(buf, " %d segments:", (int) insertData->nactions);
+
+	for (i = 0; i < insertData->nactions; i++)
+	{
+		uint8		a_segno = *((uint8 *) (walbuf++));
+		uint8		a_action = *((uint8 *) (walbuf++));
+		uint16		nitems = 0;
+		int			newsegsize = 0;
+
+		if (a_action == GIN_SEGMENT_INSERT ||
+			a_action == GIN_SEGMENT_REPLACE)
+		{
+			newsegsize = SizeOfGinPostingList((GinPostingList *) walbuf);
+			walbuf += SHORTALIGN(newsegsize);
+		}
+
+		if (a_action == GIN_SEGMENT_ADDITEMS)
+		{
+			memcpy(&nitems, walbuf, sizeof(uint16));
+			walbuf += sizeof(uint16);
+			walbuf += nitems * sizeof(ItemPointerData);
+		}
+
+		switch(a_action)
+		{
+			case GIN_SEGMENT_ADDITEMS:
+				appendStringInfo(buf, " %d (add %d items)", a_segno, nitems);
+				break;
+			case GIN_SEGMENT_DELETE:
+				appendStringInfo(buf, " %d (delete)", a_segno);
+				break;
+			case GIN_SEGMENT_INSERT:
+				appendStringInfo(buf, " %d (insert)", a_segno);
+				break;
+			case GIN_SEGMENT_REPLACE:
+				appendStringInfo(buf, " %d (replace)", a_segno);
+				break;
+			default:
+				appendStringInfo(buf, " %d unknown action %d ???", a_segno, a_action);
+				/* cannot decode unrecognized actions further */
+				return;
+		}
+	}
+}
+
 void
 gin_desc(StringInfo buf, uint8 xl_info, char *rec)
 {
@@ -70,9 +121,10 @@ gin_desc(StringInfo buf, uint8 xl_info, char *rec)
 					ginxlogRecompressDataLeaf *insertData =
 						(ginxlogRecompressDataLeaf *) payload;
 
-					appendStringInfo(buf, " unmodified: %u length: %u (compressed)",
-									 insertData->unmodifiedsize,
-									 insertData->length);
+					if (xl_info & XLR_BKP_BLOCK(0))
+						appendStringInfo(buf, " (full page image)");
+					else
+						desc_recompress_leaf(buf, insertData);
 				}
 				else
 				{
@@ -105,9 +157,10 @@ gin_desc(StringInfo buf, uint8 xl_info, char *rec)
 				ginxlogVacuumDataLeafPage *xlrec = (ginxlogVacuumDataLeafPage *) rec;
 				appendStringInfoString(buf, "Vacuum data leaf page, ");
 				desc_node(buf, xlrec->node, xlrec->blkno);
-				appendStringInfo(buf, " unmodified: %u length: %u",
-							 xlrec->data.unmodifiedsize,
-							 xlrec->data.length);
+				if (xl_info & XLR_BKP_BLOCK(0))
+					appendStringInfo(buf, " (full page image)");
+				else
+					desc_recompress_leaf(buf, &xlrec->data);
 			}
 			break;
 		case XLOG_GIN_DELETE_PAGE:
diff --git a/src/include/access/gin_private.h b/src/include/access/gin_private.h
index a7beed1..fd83886 100644
--- a/src/include/access/gin_private.h
+++ b/src/include/access/gin_private.h
@@ -406,7 +406,8 @@ typedef struct
 	 * whose split this insertion finishes. As BlockIdData[2] (beware of adding
 	 * fields before this that would make them not 16-bit aligned)
 	 *
-	 * 2. one of the following structs, depending on tree type.
+	 * 2. an ginxlogInsertEntry or ginxlogRecompressDataLeaf struct, depending
+	 * on tree type.
 	 *
 	 * NB: the below structs are only 16-bit aligned when appended to a
 	 * ginxlogInsert struct! Beware of adding fields to them that require
@@ -421,17 +422,36 @@ typedef struct
 	IndexTupleData tuple;	/* variable length */
 } ginxlogInsertEntry;
 
+
 typedef struct
 {
-	uint16		length;
-	uint16		unmodifiedsize;
+	uint16		nactions;
 
-	/* compressed segments, variable length */
-	char		newdata[1];
+	/* Variable number of 'actions' follow */
 } ginxlogRecompressDataLeaf;
 
 typedef struct
 {
+	uint8		segno;		/* segment this action applies to */
+	char		type;		/* action type (see below) */
+
+	/*
+	 * Action-specific data follows. For INSERT and REPLACE actions that is a
+	 * GinPostingList struct. For ADDITEMS, a uint16 for the number of items
+	 * added, followed by the items themselves as ItemPointers. DELETE actions
+	 * have no further data.
+	 */
+} ginxlogSegmentAction;
+
+/* Action types */
+#define GIN_SEGMENT_UNMODIFIED	0	/* no action (not used in WAL records) */
+#define GIN_SEGMENT_DELETE		1	/* a whole segment is removed */
+#define GIN_SEGMENT_INSERT		2	/* a whole segment is added */
+#define GIN_SEGMENT_REPLACE		3	/* a segment is replaced */
+#define GIN_SEGMENT_ADDITEMS	4	/* items are added to existing segment */
+
+typedef struct
+{
 	OffsetNumber offset;
 	PostingItem newitem;
 } ginxlogInsertDataInternal;
#15Robert Haas
robertmhaas@gmail.com
In reply to: Jesper Krogh (#13)
Re: HEAD seems to generate larger WAL regarding GIN index

On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper@krogh.cc> wrote:

On 15/03/14 20:27, Heikki Linnakangas wrote:

That said, I didn't expect the difference to be quite that big when you're
appending to the end of the table. When the new entries go to the end of the
posting lists, you only need to recompress and WAL-log the last posting
list, which is max 256 bytes long. But I guess that's still a lot more WAL
than in the old format.

That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...

In a concurrent update environment, fastupdate as it is in 9.2 is not really
useful. It may be that you can bulk up insertion, but you have no control
over who ends up paying the debt. Doubling the amount of wal from
gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB
wal / day, keeping it
for some weeks to be able to do PITR. The wal are mainly due to gin-index
updates as new data is added and needs to be searchable by users. We do run
gzip that cuts it down to 25-30% before keeping the for too long, but
doubling this is going to be a migration challenge.

If fast-update could be made to work in an environment where we both have
users searching the index and manually updating it and 4+ backend processes
updating the index concurrently then it would be a good benefit to gain.

the gin index currently contains 70+ million records with and average
tsvector of 124 terms.

Should we try to install some hack around fastupdate for 9.4? I fear
the divergence between reasonable values of work_mem and reasonable
sizes for that list is only going to continue to get bigger. I'm sure
there's somebody out there who has work_mem = 16GB, and stuff like
263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
appeal of large values.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16Fujii Masao
masao.fujii@gmail.com
In reply to: Robert Haas (#15)
Re: HEAD seems to generate larger WAL regarding GIN index

On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper@krogh.cc> wrote:

On 15/03/14 20:27, Heikki Linnakangas wrote:

That said, I didn't expect the difference to be quite that big when you're
appending to the end of the table. When the new entries go to the end of the
posting lists, you only need to recompress and WAL-log the last posting
list, which is max 256 bytes long. But I guess that's still a lot more WAL
than in the old format.

That could be optimized, but I figured we can live with it, thanks to the
fastupdate feature. Fastupdate allows amortizing that cost over several
insertions. But of course, you explicitly disabled that...

In a concurrent update environment, fastupdate as it is in 9.2 is not really
useful. It may be that you can bulk up insertion, but you have no control
over who ends up paying the debt. Doubling the amount of wal from
gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB
wal / day, keeping it
for some weeks to be able to do PITR. The wal are mainly due to gin-index
updates as new data is added and needs to be searchable by users. We do run
gzip that cuts it down to 25-30% before keeping the for too long, but
doubling this is going to be a migration challenge.

If fast-update could be made to work in an environment where we both have
users searching the index and manually updating it and 4+ backend processes
updating the index concurrently then it would be a good benefit to gain.

the gin index currently contains 70+ million records with and average
tsvector of 124 terms.

Should we try to install some hack around fastupdate for 9.4? I fear
the divergence between reasonable values of work_mem and reasonable
sizes for that list is only going to continue to get bigger. I'm sure
there's somebody out there who has work_mem = 16GB, and stuff like
263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
appeal of large values.

Controlling the threshold of the size of pending list only by GUC doesn't
seem reasonable. Users may want to increase the threshold only for the
GIN index which can be updated heavily, and decrease it otherwise. So
I think that it's better to add new storage parameter for GIN index to control
the threshold, or both storage parameter and GUC.

Regards,

--
Fujii Masao

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Tom Lane
tgl@sss.pgh.pa.us
In reply to: Fujii Masao (#16)
Re: HEAD seems to generate larger WAL regarding GIN index

Fujii Masao <masao.fujii@gmail.com> writes:

On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:

Should we try to install some hack around fastupdate for 9.4? I fear
the divergence between reasonable values of work_mem and reasonable
sizes for that list is only going to continue to get bigger. I'm sure
there's somebody out there who has work_mem = 16GB, and stuff like
263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
appeal of large values.

Controlling the threshold of the size of pending list only by GUC doesn't
seem reasonable. Users may want to increase the threshold only for the
GIN index which can be updated heavily, and decrease it otherwise. So
I think that it's better to add new storage parameter for GIN index to control
the threshold, or both storage parameter and GUC.

Yeah, -1 for a GUC. A GIN-index-specific storage parameter seems more
appropriate. Or we could just hard-wire some maximum limit. Is it
really likely that users would trouble to set such a parameter if it
existed?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers