What I'm working on

Started by Bruce Momjianover 27 years ago21 messages

maillist@candle.pha.pa.us

over 27 years ago

I am working on a patch to:

remove oidname, oidint2, and oidint4
allow the bootstrap code to create multi-key indexes
change procname index to procname, nargs, argtypes
remove many sequential scans of system tables and use cache
change the API to lowlevel heap and cache functions to more clearly
return tuples or copies of tuples

I have completed all but the last two items, and should finish this
week.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: Bruce Momjian (#1)

Re: [HACKERS] What I'm working on

[Charset iso-8859-1 unsupported, filtering to ASCII...]

I am working on a patch to:

remove oidname, oidint2, and oidint4
allow the bootstrap code to create multi-key indexes

Good man...always bugged me that the "old" hacked-in multikey
indexes were there after Vadim let the user create them.

Also, pg_procname index really wanted a multi-key index, but did a
sequential scan of the index after the procname match to simulate it.
That is gone too.

But...returning to Insight as of Sept.1st. Once I get settled
in, I should be to stay late a couple of evenings and get my
old patches up-to-date.

The only thing I am concerned about is that beta is September 1. I
would rather not dump lots of new patches in after the beta starts. Do
we need to start beta after the 1st? Not sure how to handle this.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Import Notes

Reply to msg id not found: 000001bdc929ee146cc0d597accf@darren | Resolved by subject fallback

The Hermit Hacker

scrappy@hub.org

over 27 years ago

In reply to: Bruce Momjian (#2)

Re: [HACKERS] What I'm working on

On Sun, 16 Aug 1998, Bruce Momjian wrote:

The only thing I am concerned about is that beta is September 1. I
would rather not dump lots of new patches in after the beta starts. Do
we need to start beta after the 1st? Not sure how to handle this.

Sept 1st is already a delay'd beta as a result of the "summer
holidays", so I'd almost have to say no to starting it after the 1st.

What I'd be curious about, at this point, is what "old" patches
are we looking at?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: The Hermit Hacker (#3)

Re: [HACKERS] What I'm working on

[Charset iso-8859-1 unsupported, filtering to ASCII...]

I am working on a patch to:

remove oidname, oidint2, and oidint4
allow the bootstrap code to create multi-key indexes

Good man...always bugged me that the "old" hacked-in multikey
indexes were there after Vadim let the user create them.

But...returning to Insight as of Sept.1st. Once I get settled
in, I should be to stay late a couple of evenings and get my
old patches up-to-date.

I have been thinking about the blocksize patch, and I now think it is
good we never installed it. I think we need to enable rows to span more
than one block. That is what commercial databases do, and I think this
is a much more general solution to the problem than increasing the block
size.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Import Notes

Reply to msg id not found: 000001bdc929ee146cc0d597accf@darren | Resolved by subject fallback

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: Bruce Momjian (#4)

Re: [HACKERS] What I'm working on

[Charset iso-8859-1 unsupported, filtering to ASCII...]

But...returning to Insight as of Sept.1st. Once I get settled
in, I should be to stay late a couple of evenings and get my
old patches up-to-date.

The only thing I am concerned about is that beta is September 1. I
would rather not dump lots of new patches in after the beta starts. Do
we need to start beta after the 1st? Not sure how to handle this.

The variable-block size patch would be nice, but I don't think it is
a big enough feature to hold up the release. Many great things are
already there, Tom's type conversions, your indexing OR's, etc...

Seems best to save my stuff for 6.5. That'll give me time to get
familiar with the code again and up to speed.

Again, I think we need rows to span multiple blocks. That would be
prefered.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

Import Notes

Reply to msg id not found: 000001bdc94617cdd120c397accf@darren | Resolved by subject fallback

The Hermit Hacker

scrappy@hub.org

over 27 years ago

In reply to: Bruce Momjian (#4)

Re: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Bruce Momjian wrote:

[Charset iso-8859-1 unsupported, filtering to ASCII...]

I am working on a patch to:

remove oidname, oidint2, and oidint4
allow the bootstrap code to create multi-key indexes

Good man...always bugged me that the "old" hacked-in multikey
indexes were there after Vadim let the user create them.

But...returning to Insight as of Sept.1st. Once I get settled
in, I should be to stay late a couple of evenings and get my
old patches up-to-date.

I have been thinking about the blocksize patch, and I now think it is
good we never installed it. I think we need to enable rows to span more
than one block. That is what commercial databases do, and I think this
is a much more general solution to the problem than increasing the block
size.

Hrmmm...what does one gain over the other though? The way I saw
it (sorry Darren, don't mean to oversimplify it), but making the blocksize
changeable was largely a matter of Darren making sure that all the
dependencies were covered through the code. What is making a row span
multiple blocks going to give us? Truly variable length "blocksizes"?

The blocksize patch allows you to stipulate a different blocksize
at database creation time...actually, thinking about it, I kinda see them
as to inter-related, yet different, functions. If, for instance, I create
a table that the majority of tuples are larger then 8k, but smaller then
12k, so that most of the tuples, in your "vision", span two
blocks...wouldn't being able to increase the blocksize to 12k provide a
performance improvement?

I'm just not sure if I see either/or being mutually exclusive.
The 'row spanning' is great from the perspective that we didn't expect the
size of the tuples being larger then 8k, while the increase of blocksize
being great from an optimizing perspective. Even having vacuum (or
something similar) reporting that >50% of the records are >$currblocksize
might be cool...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: The Hermit Hacker (#6)

Re: [HACKERS] What I'm working on

Hrmmm...what does one gain over the other though? The way I saw
it (sorry Darren, don't mean to oversimplify it), but making the blocksize
changeable was largely a matter of Darren making sure that all the
dependencies were covered through the code. What is making a row span
multiple blocks going to give us? Truly variable length "blocksizes"?

The blocksize patch allows you to stipulate a different blocksize
at database creation time...actually, thinking about it, I kinda see them
as to inter-related, yet different, functions. If, for instance, I create
a table that the majority of tuples are larger then 8k, but smaller then
12k, so that most of the tuples, in your "vision", span two
blocks...wouldn't being able to increase the blocksize to 12k provide a
performance improvement?

I'm just not sure if I see either/or being mutually exclusive.
The 'row spanning' is great from the perspective that we didn't expect the
size of the tuples being larger then 8k, while the increase of blocksize
being great from an optimizing perspective. Even having vacuum (or
something similar) reporting that >50% of the records are >$currblocksize
might be cool...

Most filesystem base block sizes are 8k. Making anything larger is not
going to gain much. I don't think we can support block sizes like 12k
because the filesystem is going to sync stuff in 8k chunks.

Seems like we should do the most user-transparent thing and just allow
spanning rows.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

The Hermit Hacker

scrappy@hub.org

over 27 years ago

In reply to: Bruce Momjian (#7)

Re: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Bruce Momjian wrote:

Most filesystem base block sizes are 8k. Making anything larger is not
going to gain much. I don't think we can support block sizes like 12k
because the filesystem is going to sync stuff in 8k chunks.

Seems like we should do the most user-transparent thing and just allow
spanning rows.

The blocksize patch wasn't a "user-land" feature, its an admin
level...no? The admin sets it at the createdb level...no?

Again, I'm curious as to why either/or is mutual exclusive?

Let's put it this way, from a performance perspective, which one
would provide more? Again, I'm thinking of this from the admin angle, not
user. I create a database whose tuples, in general, exceed 8k. vacuum
kindly tells me this, so, to improve performance, I dump my databases, and
because this is a specialized application, its on its own file system.
So, I reformat that drive with a larger blocksize, to match the blocksize
I'm about to set my database to (yes, I do do similar to this to optimize
file systems for news, so it isn't too hypothetical)...

Bear in mind, I am not arguing for one of them, I'm arguing for
both of them...unless there is some architectural reason why both can't be
implemented at the same time...?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: The Hermit Hacker (#8)

Re: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Bruce Momjian wrote:

Most filesystem base block sizes are 8k. Making anything larger is not
going to gain much. I don't think we can support block sizes like 12k
because the filesystem is going to sync stuff in 8k chunks.

Seems like we should do the most user-transparent thing and just allow
spanning rows.

The blocksize patch wasn't a "user-land" feature, its an admin
level...no? The admin sets it at the createdb level...no?

Yes, OK, admin, not user.

Again, I'm curious as to why either/or is mutual exclusive?

Let's put it this way, from a performance perspective, which one
would provide more? Again, I'm thinking of this from the admin angle, not
user. I create a database whose tuples, in general, exceed 8k. vacuum
kindly tells me this, so, to improve performance, I dump my databases, and
because this is a specialized application, its on its own file system.
So, I reformat that drive with a larger blocksize, to match the blocksize
I'm about to set my database to (yes, I do do similar to this to optimize
file systems for news, so it isn't too hypothetical)...

Bear in mind, I am not arguing for one of them, I'm arguing for
both of them...unless there is some architectural reason why both can't be
implemented at the same time...?

Yes, I guess you could have both. I just think the normal user is going
to prefer the span stuff better, but you have a good point. If we had
one, we could buy time getting the other.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

#10

Stupor Genius

stuporg@erols.com

over 27 years ago

In reply to: Bruce Momjian (#9)

Re: [HACKERS] What I'm working on

I have been thinking about the blocksize patch, and I now think it is
good we never installed it. I think we need to enable rows to span more
than one block. That is what commercial databases do, and I think this
is a much more general solution to the problem than increasing the block
size.

Hrmmm...what does one gain over the other though? The way I saw
it (sorry Darren, don't mean to oversimplify it), but making the blocksize
changeable was largely a matter of Darren making sure that all the
dependencies were covered through the code. What is making a row span
multiple blocks going to give us? Truly variable length "blocksizes"?

Would theoretically remove the postgres maximum size limit on a tuple and
make it limited by the OS file-size limit.

Right now max-tuple-size and blocksize are the same, with the blocksize
being changable only at compile-time. With the outdated patch that I
have, this would change to run-time. Would be less important if chaining
existed, but might be a decent stop-gap feature until then.

I know that Oracle has chaining and they warn that it does have an effect
on performance since a second (or more) tuple fetch has to be done. But
it that's what someone needs for big tuples, that's the price then.

I'll have more opinions after next week.

Darren

Import Notes

Resolved by subject fallback

#11

The Hermit Hacker

scrappy@hub.org

over 27 years ago

In reply to: Bruce Momjian (#9)

Re: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Bruce Momjian wrote:

Yes, I guess you could have both. I just think the normal user is going
to prefer the span stuff better, but you have a good point. If we had
one, we could buy time getting the other.

For whomever is implementing the row-span stuff, can something be
added that keeps track of number of rows that are spanned? ie. if most of
the rows are spanning the rows, then I would personally like to know that
so that I can look at dumping and reloading the data with a database set
to a higher blocksize...

There *has* to be some overhead, performance wise, in the database
having to keep track of row-spanning, and being able to reduce that, IMHO,
is what I see being able to change the blocksize as doing...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#12

Stupor Genius

stuporg@erols.com

over 27 years ago

In reply to: The Hermit Hacker (#11)

RE: [HACKERS] What I'm working on

There *has* to be some overhead, performance wise, in the database
having to keep track of row-spanning, and being able to reduce that, IMHO,
is what I see being able to change the blocksize as doing...

If both features were present, I would say to increase the blocksize of
the db to the max possible. This would reduce the number of tuples that
are spanned. Each span would require another tuple fetch, so that could
get expensive with each successive span or if every tuple spanned.

But if we stick with 8k blocksizes, people with tuples between 8 and 16k
would get absolutely killed performance-wise. Would make sense for them
to go to 16k blocks where the reading of the extra bytes per block would
be minimal, if anything, compared to the fetching/processing of the next
span(s) to assemble the whole tuple.

In summary, the capability to span would be the next resort after someone
has maxed out their blocksize. Each OS would have a different blocksize
max...an AIX driver breaks when going past 16k...don't know about others.

I'd say make the blocksize a run-time variable and then do the spanning.

Darren

#13

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: The Hermit Hacker (#11)

Re: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Bruce Momjian wrote:

Yes, I guess you could have both. I just think the normal user is going
to prefer the span stuff better, but you have a good point. If we had
one, we could buy time getting the other.

For whomever is implementing the row-span stuff, can something be
added that keeps track of number of rows that are spanned? ie. if most of
the rows are spanning the rows, then I would personally like to know that
so that I can look at dumping and reloading the data with a database set
to a higher blocksize...

There *has* to be some overhead, performance wise, in the database
having to keep track of row-spanning, and being able to reduce that, IMHO,
is what I see being able to change the blocksize as doing...

Makes sense, though vacuum would presumably make all the blocks
contigious.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

#14

The Hermit Hacker

scrappy@hub.org

over 27 years ago

In reply to: Stupor Genius (#12)

RE: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Stupor Genius wrote:

There *has* to be some overhead, performance wise, in the database
having to keep track of row-spanning, and being able to reduce that, IMHO,
is what I see being able to change the blocksize as doing...

If both features were present, I would say to increase the blocksize of
the db to the max possible. This would reduce the number of tuples that
are spanned. Each span would require another tuple fetch, so that could
get expensive with each successive span or if every tuple spanned.

But if we stick with 8k blocksizes, people with tuples between 8 and 16k
would get absolutely killed performance-wise. Would make sense for them
to go to 16k blocks where the reading of the extra bytes per block would
be minimal, if anything, compared to the fetching/processing of the next
span(s) to assemble the whole tuple.

In summary, the capability to span would be the next resort after someone
has maxed out their blocksize. Each OS would have a different blocksize
max...an AIX driver breaks when going past 16k...don't know about others.

Oh...I like this :) that would give us something that the "big
guys" don't also, no? Bruce?

Can someone clarify something for me? If, for example, we have
the blocksize set to 16k, but the file system size is 8k, would the OS do
both reads at the same time in order to get the full 16k? I hope someone
can follow this through (unless I'm actually clear), but if we left the
tuples size at 8k fixed, and had that 16k tuple span two rows, do we send
a request to the OS for the one block, then, once we get that back,
determine that we need the next and request that?

Damn, not clear at all...if I'm thinking right, by increasing the
blocksize to 16k, postgres does one read request, while the OS does two.
If we don't, postgres does two read requests while the OS still does two.

Does that make sense?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#15

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: Stupor Genius (#12)

Re: [HACKERS] What I'm working on

[Charset iso-8859-1 unsupported, filtering to ASCII...]

There *has* to be some overhead, performance wise, in the database
having to keep track of row-spanning, and being able to reduce that, IMHO,
is what I see being able to change the blocksize as doing...

If both features were present, I would say to increase the blocksize of
the db to the max possible. This would reduce the number of tuples that
are spanned. Each span would require another tuple fetch, so that could
get expensive with each successive span or if every tuple spanned.

But if we stick with 8k blocksizes, people with tuples between 8 and 16k
would get absolutely killed performance-wise. Would make sense for them
to go to 16k blocks where the reading of the extra bytes per block would
be minimal, if anything, compared to the fetching/processing of the next
span(s) to assemble the whole tuple.

In summary, the capability to span would be the next resort after someone
has maxed out their blocksize. Each OS would have a different blocksize
max...an AIX driver breaks when going past 16k...don't know about others.

I'd say make the blocksize a run-time variable and then do the spanning.

If we could query to find the file system block size at runtime in a
portable way, that would help us pick the best block size, no?

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

#16

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: The Hermit Hacker (#14)

Re: [HACKERS] What I'm working on

Oh...I like this :) that would give us something that the "big
guys" don't also, no? Bruce?

Can someone clarify something for me? If, for example, we have
the blocksize set to 16k, but the file system size is 8k, would the OS do
both reads at the same time in order to get the full 16k? I hope someone
can follow this through (unless I'm actually clear), but if we left the
tuples size at 8k fixed, and had that 16k tuple span two rows, do we send
a request to the OS for the one block, then, once we get that back,
determine that we need the next and request that?

The filesystem block size really controls how fine-graned the file block
allocation is. It keeps 8k blocks as one contigious chunk on the disk
(ignoring trailing file fragments which are blocksize/8 in size).

How the OS does the disk requests is different. It is related to the
base size of a disk block(usually 512 bytes), and if multiple requests
can be sent to the drive at the same time(tagged queuing?). These are
really not related to the filesystem block size, except that larger
block sizes are made up of larger contigious disk block groups.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

#17

The Hermit Hacker

scrappy@hub.org

over 27 years ago

In reply to: Bruce Momjian (#13)

Re: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Bruce Momjian wrote:

On Sun, 23 Aug 1998, Bruce Momjian wrote:

Yes, I guess you could have both. I just think the normal user is going
to prefer the span stuff better, but you have a good point. If we had
one, we could buy time getting the other.

For whomever is implementing the row-span stuff, can something be
added that keeps track of number of rows that are spanned? ie. if most of
the rows are spanning the rows, then I would personally like to know that
so that I can look at dumping and reloading the data with a database set
to a higher blocksize...

There *has* to be some overhead, performance wise, in the database
having to keep track of row-spanning, and being able to reduce that, IMHO,
is what I see being able to change the blocksize as doing...

Makes sense, though vacuum would presumably make all the blocks
contigious.

Still going to involve two read requests from the postmaster to
the operating system for those two rows...vs one if the tuple doesn't have
to span two blocks...

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#18

Bruce Momjian

maillist@candle.pha.pa.us

over 27 years ago

In reply to: The Hermit Hacker (#17)

Re: [HACKERS] What I'm working on

Still going to involve two read requests from the postmaster to
the operating system for those two rows...vs one if the tuple doesn't have
to span two blocks...

Yes, assuming it is not already in our buffer cache.

-- 
Bruce Momjian                          |  830 Blythe Avenue
maillist@candle.pha.pa.us              |  Drexel Hill, Pennsylvania 19026
  +  If your life is a hard drive,     |  (610) 353-9879(w)
  +  Christ can be your backup.        |  (610) 853-3000(h)

#19

The Hermit Hacker

scrappy@hub.org

over 27 years ago

In reply to: Bruce Momjian (#15)

Re: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Bruce Momjian wrote:

[Charset iso-8859-1 unsupported, filtering to ASCII...]

There *has* to be some overhead, performance wise, in the database
having to keep track of row-spanning, and being able to reduce that, IMHO,
is what I see being able to change the blocksize as doing...

If both features were present, I would say to increase the blocksize of
the db to the max possible. This would reduce the number of tuples that
are spanned. Each span would require another tuple fetch, so that could
get expensive with each successive span or if every tuple spanned.

But if we stick with 8k blocksizes, people with tuples between 8 and 16k
would get absolutely killed performance-wise. Would make sense for them
to go to 16k blocks where the reading of the extra bytes per block would
be minimal, if anything, compared to the fetching/processing of the next
span(s) to assemble the whole tuple.

In summary, the capability to span would be the next resort after someone
has maxed out their blocksize. Each OS would have a different blocksize
max...an AIX driver breaks when going past 16k...don't know about others.

I'd say make the blocksize a run-time variable and then do the spanning.

If we could query to find the file system block size at runtime in a
portable way, that would help us pick the best block size, no?

That doesn't sound too safe to me...what if I run out of disk
space on file system A (16k blocksize) and move one of the databases to
file system B (8k blocksize)? If it auto-detects at run time, how is that
going to affect the tables? Now my tuple size just dropp'd to 8k, but the
tables were using 16k tuples...

Setting this should, I think, be a conscious decision on the
admins part, unless, of course, there is nothing in the tables themselves
that are "hard coded" at 8k tuples, and its purely in the server? If it
is just in the server, then this would be cool, cause then I wouldn't have
to dump/reload if I moved to a better tuned file system..just move the
files :)

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#20

The Hermit Hacker

scrappy@hub.org

over 27 years ago

In reply to: Bruce Momjian (#16)

Re: [HACKERS] What I'm working on

On Sun, 23 Aug 1998, Bruce Momjian wrote:

Oh...I like this :) that would give us something that the "big
guys" don't also, no? Bruce?

Can someone clarify something for me? If, for example, we have
the blocksize set to 16k, but the file system size is 8k, would the OS do
both reads at the same time in order to get the full 16k? I hope someone
can follow this through (unless I'm actually clear), but if we left the
tuples size at 8k fixed, and had that 16k tuple span two rows, do we send
a request to the OS for the one block, then, once we get that back,
determine that we need the next and request that?

The filesystem block size really controls how fine-graned the file block
allocation is. It keeps 8k blocks as one contigious chunk on the disk
(ignoring trailing file fragments which are blocksize/8 in size).

How the OS does the disk requests is different. It is related to the
base size of a disk block(usually 512 bytes), and if multiple requests
can be sent to the drive at the same time(tagged queuing?). These are
really not related to the filesystem block size, except that larger
block sizes are made up of larger contigious disk block groups.

Okay...but, what I was more trying to get at was that, ignoring
the operating system level right now, a 16k tuple that has to span two 8k
'rows' is going to require:

1 read for the first half
processing to determine that a second half is required
1 read for the second half

A 16k that spans a single 16k row will require:

1 read for the whole thing

considering all the streamlining that you've been working at, it
seems illogical to advocate a two read system only, when we can have a two
read system that gives us a base solution, with a one read system for
those that wish to reduce that overhead...no?

Marc G. Fournier
Systems Administrator @ hub.org
primary: scrappy@hub.org secondary: scrappy@{freebsd|postgresql}.org

#21

Andreas Zeugswetter

andreas.zeugswetter@telecom.at

over 27 years ago

In reply to: The Hermit Hacker (#20)

AW: [HACKERS] What I'm working on

In summary, the capability to span would be the next resort after someone
has maxed out their blocksize. Each OS would have a different blocksize
max...an AIX driver breaks when going past 16k...don't know about others.

I'd say make the blocksize a run-time variable and then do the spanning.

If we could query to find the file system block size at runtime in a
portable way, that would help us pick the best block size, no?

No, I would really not suggest that. Having one default page size is really the
best thing. If we have chaining, making the default 4k is probably a good thing.
Most commercial DBMS's have a tuneable blocksize with a 2k or 4k default.

You usually have tables so different, that specifying one optimal blocksize would not
be possible.

The chained row performance hit could be equalled out by implementing good read ahead.
The consecutive pages would already be read when they are needed. A good OS
does this anyway.

Doing tests with dd show, that the blocksize does have a performance impact
when doing IO on a file system file. Even if the OS does the read ahead in an
appropriate blocksize, there are a lot more system calls with small block sizes.
Where the blocksize does show dramatically is on raw devices.

So for IO reasons the blocksize should be large (on my box 64k or 256k).
But, such large blocks have a negative effect on buffer cache hit ratio.
Usually you say:
OLTP system --> small block size to maximize buffer usage and simultaneous access
DSS/OLAP systems --> large block size to maximize sequential scan performance

Conclusio:
The size of a block is always a trade off between IO bandwidth and memory usage.
Therefore having it a tuneable parameter per instance or database is best.

Andreas

Import Notes

Resolved by subject fallback