adding new pages bulky way

Started by Victor Y. Yegorovover 20 years ago13 messages

I need your advice.

For on-disk bitmap I run a list of TIDs.

TIDs are stored in pages as an array, page's opaque data holds an array of
bits, indicating whether corresponding TID has been deleted and should be
skipped during the scan.

Pages, that contain TIDs list, are organized in extents, each extent has 2^N
pages, where N is extent's number (i.e. 2nd extent will occupy 4 pages).
Given that I know number of TIDs, that fit into one page, and the TID's
sequential number, I can easily calculate:
- extent number TID belongs to;
- page offset inside that extent, and;
- TID place in the page.

At the moment, I store BlockNumber of the extent's first page in the
metapage and allocate all pages that belongs to that extent sequentially. I
need to do so to minimize number of page reads when searching for the TID in
the list; I'll need to read 1 page at most to find out TID at given position
during the scan. I hope you understood the idea.

This also means, that while extent's pages are being added this way, no other
pages can be added to the index. And the higher is extent's number, the more
time it'll take to allocate all pages.

The question is: allocating pages this way is really ugly, I understand. Is
there some API that would allow allocating N pages in the bulk way?
Maybe this is a know problem, that has been already solved before?
Any other ideas?

Thanks in advance!

--

Victor Y. Yegorov

#2Alvaro Herrera
alvherre@surnet.cl
In reply to: Victor Y. Yegorov (#1)
Re: adding new pages bulky way

On Mon, Jun 06, 2005 at 10:59:04PM +0300, Victor Y. Yegorov wrote:

The question is: allocating pages this way is really ugly, I understand. Is
there some API that would allow allocating N pages in the bulk way?
Maybe this is a know problem, that has been already solved before?
Any other ideas?

I don't understand your question. What's the problem with holding the
extend lock for the index relation while you extend it? Certainly you
want only a single process creating a new extent in the index, right?

I guess the question is when are the extents created, and what
concurrency do you expect from that operation.

--
Alvaro Herrera (<alvherre[a]surnet.cl>)
"La naturaleza, tan fr�gil, tan expuesta a la muerte... y tan viva"

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Victor Y. Yegorov (#1)
Re: adding new pages bulky way

"Victor Y. Yegorov" <viy@mits.lv> writes:

[ scheme involving a predetermined layout of index pages ]

The question is: allocating pages this way is really ugly, I understand. Is
there some API that would allow allocating N pages in the bulk way?

Why bother? Just write each page when you need to --- there's no law
that says you must use P_NEW. The hash index type does something pretty
similar, IIRC.

regards, tom lane

#4Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Victor Y. Yegorov (#1)
Re: adding new pages bulky way

"Tom Lane" <tgl@sss.pgh.pa.us> writes

Why bother? Just write each page when you need to --- there's no law
that says you must use P_NEW. The hash index type does something pretty
similar, IIRC.

Is there any performance benefits if we have a mdextend_several_pages()
function in md.c? So the relation can be extended in a bulky way. In my
understanding, if we write

write(fd, buffer, BLCKSZ*10)

instead of

for (i=0; i<10; i++)
write(fd, buffer, BLCKSZ);

This will reduce some costs of file system logs for journal file systems. Of
course, the cost we have to pay is the longer time of holding relation
extension lock.

Regards,
Qingqing

#5Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#4)
Re: adding new pages bulky way

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

Is there any performance benefits if we have a mdextend_several_pages()
function in md.c?

I very seriously doubt that there would be *any* win, and I doubt even
more that it could possibly be worth the klugery you'd have to do to
make it happen. Bear in mind that index access methods are two API
layers away from md.c --- how will you translate this into something
that makes sense in the context of bufmgr's API?

regards, tom lane

#6Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Victor Y. Yegorov (#1)
Re: adding new pages bulky way

"Tom Lane" <tgl@sss.pgh.pa.us> writes

I very seriously doubt that there would be *any* win, and I doubt even
more that it could possibly be worth the klugery you'd have to do to
make it happen. Bear in mind that index access methods are two API
layers away from md.c --- how will you translate this into something
that makes sense in the context of bufmgr's API?

Index access or heap access doesn't matter. The imaginary plan is like this:

-- change 1 --
/* md.c */
mdextend()
{
mdextend_several_pages();
add_pages_to_FSM();
}

-- change 2 --
/*
* Any places hold relation extension lock
*/

if (needLock)
LockPage(relation, 0, ExclusiveLock);

/* ADD: check again here */
if (InvalidBlockNumber != GetPageWithFreeSpace())
UnlockPage(relation, 0, ExclusiveLock);

/* I have to do the extension */
buffer = ReadBuffer(relation, P_NEW);

Above code is quite like how we handle xlogflush() currently.

In reply to: Tom Lane (#3)
Re: adding new pages bulky way

* Tom Lane <tgl@sss.pgh.pa.us> [07.06.2005 07:59]:

Why bother? Just write each page when you need to --- there's no law
that says you must use P_NEW.

This means 2 things:
1) I cannot mix P_NEW and exact-number ReadBuffer() calls;
2) thus, I have to track next-block-number myself.

Is it so?

BTW, are there any differences in buffer seeking speed, if buffer
block-numbers are mixed and if they're not (i.e. P_NEW is used)?

--

Victor Y. Yegorov

#8Alvaro Herrera
alvherre@surnet.cl
In reply to: Victor Y. Yegorov (#7)
Re: adding new pages bulky way

On Tue, Jun 07, 2005 at 07:52:57PM +0300, Victor Y. Yegorov wrote:

* Tom Lane <tgl@sss.pgh.pa.us> [07.06.2005 07:59]:

Why bother? Just write each page when you need to --- there's no law
that says you must use P_NEW.

This means 2 things:
1) I cannot mix P_NEW and exact-number ReadBuffer() calls;

Huh, why? You need to grab the relation extension block
(LockRelationForExtension in CVS tip).

--
Alvaro Herrera (<alvherre[a]surnet.cl>)
"[PostgreSQL] is a great group; in my opinion it is THE best open source
development communities in existence anywhere." (Lamar Owen)

In reply to: Alvaro Herrera (#8)
Re: adding new pages bulky way

* Alvaro Herrera <alvherre@surnet.cl> [08.06.2005 00:39]:

Huh, why? You need to grab the relation extension block
(LockRelationForExtension in CVS tip).

Really? Didn't knew that.

Consider:
1) I add 2 pages to the newly-created relation
using P_NEW as BlockNumber;
2) then I do LockRelationForExtension; ReadBuffer(135) and
UnockRelationForExtension.

What BlockNumber will be assigned to the buffer, if I call
ReadBuffer(P_NEW) now? 136?

BTW, is it OK to say "BlockNumber is assigned to buffer"?

--

Victor Y. Yegorov

#10Tom Lane
tgl@sss.pgh.pa.us
In reply to: Victor Y. Yegorov (#9)
Re: adding new pages bulky way

"Victor Y. Yegorov" <viy@mits.lv> writes:

* Alvaro Herrera <alvherre@surnet.cl> [08.06.2005 00:39]:

Huh, why? You need to grab the relation extension block
(LockRelationForExtension in CVS tip).

Really? Didn't knew that.

Consider:
1) I add 2 pages to the newly-created relation
using P_NEW as BlockNumber;
2) then I do LockRelationForExtension; ReadBuffer(135) and
UnockRelationForExtension.

As things are set up at the moment, you really should not use
P_NEW at all unless you hold the relation extension lock.

(At least not for ordinary heap relations. An index access
method could have its own rules about how to add blocks to
the relation --- hash does for instance.)

This is all pretty ugly in my view, and so I would not stand
opposed to ideas about a cleaner design ...

regards, tom lane

#11Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Victor Y. Yegorov (#1)
Re: adding new pages bulky way

"Tom Lane" <tgl@sss.pgh.pa.us> writes

I very seriously doubt that there would be *any* win

I did a quick proof-concept implemenation to test non-concurrent batch
insertion, here is the result:

Envrionment:
- Pg8.0.1
- NTFS / IDE

-- batch 16 pages extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4167.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 8111.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 16444.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 41980.000 ms

-- batch 32 pages extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4086.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 7861.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 16403.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 41290.000 ms

-- batch 64 pages extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4236.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 8202.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 17265.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 44063.000 ms

-- batch 128 pages extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4256.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 8242.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 17375.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 43854.000 ms

-- one page extension --
test=# insert into t select * from t;
INSERT 0 131072
Time: 4496.000 ms
test=# insert into t select * from t;
INSERT 0 262144
Time: 9013.000 ms
test=# insert into t select * from t;
INSERT 0 524288
Time: 19508.000 ms
test=# insert into t select * from t;
INSERT 0 1048576
Time: 49962.000 ms

Benefits are there, and it is an approximate 10% improvement if we select
good batch size. The explaination is: if a batch insertion need 6400 new
pages, originally it does write()+file system logs 6400 times, now it does
6400/64 times(though each time the time cost is bigger). Also, considering
write with different size have different cost, seems for my machine 32 is
the an optimal choice.

What I did include:

(1) md.c
Modify function mdextend():
- extend 64 pages each time;
- after extension, let FSM be aware of it (change FSM a little bit so it
could report freespace also for an empty page)

(2) bufmgr.c
make ReadPage(+empty_page) treat different of an empty page and non-empty
one to avoid unnecesary read for new pages, that is:
if (!empty_page)
smgrread(reln->rd_smgr, blockNum, (char *) MAKE_PTR(bufHdr->data));
else
PageInit((char *) MAKE_PTR(bufHdr->data), BLCKSZ, 0); /* Only for
heap pages and race could be here ... */

(3) hio.c
RelationGetBufferForTuple():
- pass correct "empty_page" parameter to ReadPage() according to the query
result from FSM.

Regards,
Qingqing

#12Tom Lane
tgl@sss.pgh.pa.us
In reply to: Qingqing Zhou (#11)
Re: adding new pages bulky way

"Qingqing Zhou" <zhouqq@cs.toronto.edu> writes:

What I did include:

make ReadPage(+empty_page) treat different of an empty page and non-empty
one to avoid unnecesary read for new pages, that is:

In other words, if FSM is wrong you will overwrite valid data? No
thanks ... this is guaranteed to fail under simple concurrent usage,
let alone any more interesting scenarios like FSM being actually out of
date.

regards, tom lane

#13Qingqing Zhou
zhouqq@cs.toronto.edu
In reply to: Victor Y. Yegorov (#1)
Re: adding new pages bulky way

"Tom Lane" <tgl@sss.pgh.pa.us> writes

In other words, if FSM is wrong you will overwrite valid data? No
thanks ... this is guaranteed to fail under simple concurrent usage,
let alone any more interesting scenarios like FSM being actually out of
date.

You are welcome ;-). The FSM race/corruption is a trouble. Maybe we could
put it in TODO list for better solutions since we can see the performance
benefits are there.

Regards,
Qingqing