On improving OO support in posgresql and relaxing oid bottleneck at the same time
Hi,
I'm currently under the impression that the following change in the
postgresql system would benefict the overall performance and quality
of the system.
Tuples for a class and all it's derived classes are stored in one file.
Advantages:
- Since all tuples for a given class hierarchy are stored in the same
physical file,
oids now need only be unique to a single inheritance hierarchy
(instead of unique in each posgresql installation).
So no longer is there any _necessity_ for a systemwide unique oid.
(This necessity existed because all objects by definition (in OO
sematics)
must posses the identity property ("this" in C++/Java sometimes also
called "self"),
and because instances of the same hierachy were stored in different files
it was necesary to provide the identity property in a "file independant"
way.
The bottleneck formed by the systemwide unique oid is replaced by a
bottleneck for each inheritance hierarchy within an installation.
If one doesn't use inheritance then it translates to a per table
bottleneck.
(Which is what we have now anyway isn't it?).
- Indices, triggers, contraints, etc. are automatically inherited.
so that we can showcase classic OO semantics (including polymorphism).
- Makes easy implementation of referential integrity for oids possible.
- It becomes possible to store more than 4Giga tuples
on 32 bit systems
- given an instance of a class identified by an oid it is easy to determine
the most derived class it belongs to.
(This feature has been requested by a number of poeple on the
questions list.)
- It is the first step to support tables with no oids at all (not that this
is particularly interesting to me though). I'd suggest that system
catalogues
keep their oids though our we would be in for a major rewrite I think.
Disadvantages
- sequential heapscans for tables _with_ derived classes will be less
efficient
in general, because now some tuples may have to be skipped since they
may
belong to the wrong class. This is easily solved using indices.
- slight space overhead for tuple when not using inheritance.
The space is used to tag each tuple with the most derived class it
belongs to.
To improve OO support the implementation plan is to:
1. Add a system attribute to each heap tuple which identifies the most
derived
class the instance belongs to. (easy; I think)
2. Store instances of derived classes in the same physical file as the top
most base class. I hope that hacking heapopen() to tell it in which file
it should look for tuples of a particular relation will be enough.
Maybe this might have implications for caching etc. which I don't
understand.
(difficult?)
3. modify the heap_scanning functions to support the new sceem. (easy; I
think)
Now for my questions.
- Is implementing the above major surgery?
- Am I missing something important?
- What do you guys think of this?
With regards from Maurice.
I'm currently under the impression that the following change in the
postgresql system would benefict the overall performance and quality
of the system.Tuples for a class and all it's derived classes stored in one file.
I hate to sound like a "small thinker" here, but I'd be concerned about
some issues:
1) true OO semantics are difficult/impossible to accomplish with SQL.
This is one reason why Postgres is probably in the OR realm rather than
true OO.
2) Supporting inheritance using one-file storage probably leads to
larger overhead in _all_ file accesses, not just ones containing
inherited tables. Tuples would now contain a variable number of fields,
with variable definitions, with ... Ack! :)
3) Indices are fundamentally present to speed up access, though we use
them for other purposes too (such as enforcing uniqueness). Perhaps the
topic of inheritance, uniqueness, and referential integrity (foreign
keys, etc) should be solved (or at least discussed) independent of
indices, though indices or index-like structures may be involved in the
solution.
4) imho, the roughest areas of existing (or missing) capability in
Postgres involve array types and types which require additional support
information (such as exact numerics). Focusing on fixing/improving these
areas may lead to cleaning up semantics, mechanisms, and capabilities in
the backend, and make other (more derived?) features such as constraint
inheritance and enforcement easier to implement. Well, it will help
something anyway, even if not constraints :)
- Tom
-----Original Message-----
From: Thomas G. Lockhart <lockhart@alumni.caltech.edu>
To: Maurice Gittens <mgittens@gits.nl>
Cc: hackers@postgreSQL.org <hackers@postgreSQL.org>
Date: zondag 5 april 1998 23:56
Subject: Re: [HACKERS] On improving OO support in posgresql and relaxing oid
bottleneck at the same time
I'm currently under the impression that the following change in the
postgresql system would benefict the overall performance and quality
of the system.Tuples for a class and all it's derived classes stored in one file.
I hate to sound like a "small thinker" here, but I'd be concerned about
some issues:1) true OO semantics are difficult/impossible to accomplish with SQL.
This is one reason why Postgres is probably in the OR realm rather than
true OO.
Ok, let's be more specific. We'll define OO semantics as
support for:
1. identity
(We already have partial support); we just have to get the details right
2. inheritance
We allready have support. I'm suggesting an implementation which
is a better overall choice IMO. Because we avoid the system oid
unique oid requirement while it also provides for an improvent
in the support for polymorphism.
3. polymorphism
partially supported, but some necesary properties are not yet
inherited automatically. I believe overriding triggers
is likely to work automatically. There are however some
choices which we'll have to make.
As far as I see, these concepts can be implemented without
any changes to the current definition of the query language in
postgresql.
Encapsulation would seem to require new syntax. It also
seems not to fully fit in to the the relation model, so
we leave it out.
2) Supporting inheritance using one-file storage probably leads to
larger overhead in _all_ file accesses, not just ones containing
inherited tables. Tuples would now contain a variable number of fields,
with variable definitions, with ... Ack! :)
Yes but this overhead is very small for tables without inheritance.
An extra statement like:
heap_getnext(ScanDesc scanDesc)
{
...
while(!done)
{
tuple = readTuple(...)
...
if (IsInstanceOf(tuple -> reloid, scanDesc -> reloid)
{
return tuple;
}
...
}
The information on inheritance relations between classes can be precomputed
when a heap scandesc is created.
This IMO this overhead is not significant, when there is no inheritance.
When there is inheritance we simple use indices to speed things up,
if it's deemed necesary.
3) Indices are fundamentally present to speed up access, though we use
them for other purposes too (such as enforcing uniqueness). Perhaps the
topic of inheritance, uniqueness, and referential integrity (foreign
keys, etc) should be solved (or at least discussed) independent of
indices, though indices or index-like structures may be involved in the
solution.
Lets consider the following mail to the questions list by Brett McCormick
<brett@work.chicken.org> (copied from the list archive):
I've got a table that has a primary key with a default of
nextval('seq'). I've got another table which inherits this one, but
it fails to inherit the unique btree index. It does inherit the
default value. So, I'm assuming that if I create a unique index for
that field on the child table, it won't keep you from inserting values
that exist in that field in the parent table (and since they both
share the same sequence, that's what I want).So primary keys do not work in this situation. Are there plans to
enhance the inheritance? I have no idea how it works, is it
intelligent? Seems more klunky than not, but I haven't really looked
at the code. Should I stop using inheritance altogether, considering
its drawbacks (no idea what child class it is in when selecting from
parent and all children, no shared indices/pkeys) when I don't select
from them all at once?
This person identifies a number of problems with the current system.
- no idea what child class it is when selecting from parent and all children
- no shared indices/primary keys
- no inheritance of unique attribute etc.
I can also add similar points
- triggers should also be inherited. This gives us polymorphism without
without introducing any new syntax.
- etc.
I agree that conceptually indices are present only for speed. But the
reality is that by inheriting them we give users that which they
expect. (There are more emails like this one to be found on
the questions lists).
I think that what Brett wants to do is legitemate.
Storing the tuples of a same class hierarchy in different files is IMO
an unfortunate design choice of the original implementors
of postgresql.
The suggestion I'm making solves all of Brett's problems.
4) imho, the roughest areas of existing (or missing) capability in
Postgres involve array types and types which require additional support
information (such as exact numerics). Focusing on fixing/improving these
areas may lead to cleaning up semantics, mechanisms, and capabilities in
the backend, and make other (more derived?) features such as constraint
inheritance and enforcement easier to implement. Well, it will help
something anyway, even if not constraints :)
I see that we have similar ideas about where the system should eventually
be. I do however believe that we'll get there by means of cleaning up
the semantics and then using these cleaned semantics to
make the system as a whole more conceptually pure.
In my experience systems which are conceptually pure can be
made to be very efficient.
I think that removing the oid bottleneck, while also solving
a number of fundamental problems (from an OO perspective)
with one and the same change, is a Good Thing (tm) -:).
Thanks, with regards from Maurice.
Import Notes
Resolved by subject fallback
are there some good, human-readable documents that outline these and
other basic OO concepts? I've done some OO programming, but I'm fuzzy
on a lot of issues. sorry to be so off-topic
--brett
On Mon, 6 April 1998, at 12:49:23, Maurice Gittens wrote:
Show quoted text
-----Original Message-----
From: Thomas G. Lockhart <lockhart@alumni.caltech.edu>
To: Maurice Gittens <mgittens@gits.nl>
Cc: hackers@postgreSQL.org <hackers@postgreSQL.org>
Date: zondag 5 april 1998 23:56
Subject: Re: [HACKERS] On improving OO support in posgresql and relaxing oid
bottleneck at the same timeI'm currently under the impression that the following change in the
postgresql system would benefict the overall performance and quality
of the system.Tuples for a class and all it's derived classes stored in one file.
I hate to sound like a "small thinker" here, but I'd be concerned about
some issues:1) true OO semantics are difficult/impossible to accomplish with SQL.
This is one reason why Postgres is probably in the OR realm rather than
true OO.Ok, let's be more specific. We'll define OO semantics as
support for:
1. identity
(We already have partial support); we just have to get the details right
2. inheritance
We allready have support. I'm suggesting an implementation which
is a better overall choice IMO. Because we avoid the system oid
unique oid requirement while it also provides for an improvent
in the support for polymorphism.
3. polymorphism
partially supported, but some necesary properties are not yet
inherited automatically. I believe overriding triggers
is likely to work automatically. There are however some
choices which we'll have to make.As far as I see, these concepts can be implemented without
any changes to the current definition of the query language in
postgresql.Encapsulation would seem to require new syntax. It also
seems not to fully fit in to the the relation model, so
we leave it out.2) Supporting inheritance using one-file storage probably leads to
larger overhead in _all_ file accesses, not just ones containing
inherited tables. Tuples would now contain a variable number of fields,
with variable definitions, with ... Ack! :)Yes but this overhead is very small for tables without inheritance.
An extra statement like:heap_getnext(ScanDesc scanDesc)
{
...
while(!done)
{
tuple = readTuple(...)
...
if (IsInstanceOf(tuple -> reloid, scanDesc -> reloid)
{
return tuple;
}
...
}The information on inheritance relations between classes can be precomputed
when a heap scandesc is created.This IMO this overhead is not significant, when there is no inheritance.
When there is inheritance we simple use indices to speed things up,
if it's deemed necesary.3) Indices are fundamentally present to speed up access, though we use
them for other purposes too (such as enforcing uniqueness). Perhaps the
topic of inheritance, uniqueness, and referential integrity (foreign
keys, etc) should be solved (or at least discussed) independent of
indices, though indices or index-like structures may be involved in the
solution.Lets consider the following mail to the questions list by Brett McCormick
<brett@work.chicken.org> (copied from the list archive):I've got a table that has a primary key with a default of
nextval('seq'). I've got another table which inherits this one, but
it fails to inherit the unique btree index. It does inherit the
default value. So, I'm assuming that if I create a unique index for
that field on the child table, it won't keep you from inserting values
that exist in that field in the parent table (and since they both
share the same sequence, that's what I want).So primary keys do not work in this situation. Are there plans to
enhance the inheritance? I have no idea how it works, is it
intelligent? Seems more klunky than not, but I haven't really looked
at the code. Should I stop using inheritance altogether, considering
its drawbacks (no idea what child class it is in when selecting from
parent and all children, no shared indices/pkeys) when I don't select
from them all at once?This person identifies a number of problems with the current system.
- no idea what child class it is when selecting from parent and all children
- no shared indices/primary keys
- no inheritance of unique attribute etc.I can also add similar points
- triggers should also be inherited. This gives us polymorphism without
without introducing any new syntax.
- etc.I agree that conceptually indices are present only for speed. But the
reality is that by inheriting them we give users that which they
expect. (There are more emails like this one to be found on
the questions lists).I think that what Brett wants to do is legitemate.
Storing the tuples of a same class hierarchy in different files is IMO
an unfortunate design choice of the original implementors
of postgresql.The suggestion I'm making solves all of Brett's problems.
4) imho, the roughest areas of existing (or missing) capability in
Postgres involve array types and types which require additional support
information (such as exact numerics). Focusing on fixing/improving these
areas may lead to cleaning up semantics, mechanisms, and capabilities in
the backend, and make other (more derived?) features such as constraint
inheritance and enforcement easier to implement. Well, it will help
something anyway, even if not constraints :)I see that we have similar ideas about where the system should eventually
be. I do however believe that we'll get there by means of cleaning up
the semantics and then using these cleaned semantics to
make the system as a whole more conceptually pure.In my experience systems which are conceptually pure can be
made to be very efficient.I think that removing the oid bottleneck, while also solving
a number of fundamental problems (from an OO perspective)
with one and the same change, is a Good Thing (tm) -:).Thanks, with regards from Maurice.
-----Original Message-----
From: Maurice Gittens <mgittens@gits.nl>
To: hackers@postgreSQL.org <hackers@postgreSQL.org>
Date: zondag 5 april 1998 21:47
Subject: [HACKERS] On improving OO support in posgresql and relaxing oid
bottleneck at the same time
Hi,
I'm currently under the impression that the following change in the
postgresql system would benefict the overall performance and quality
of the system.Tuples for a class and all it's derived classes are stored in one file.
Advantages:
-- cut --
Disadvantages
- sequential heapscans for tables _with_ derived classes will be less
efficient
in general, because now some tuples may have to be skipped since they
may
belong to the wrong class. This is easily solved using indices.- slight space overhead for tuple when not using inheritance.
The space is used to tag each tuple with the most derived class it
belongs to.
One extra disadvantage of this is that multiple inheritance is only
easily supported if bases classes being inherited from have a common
top most base class. In which all tuples are stored. Otherwise
we'll storagefile independant oid's will become necesary again.
So loosely speaking it still allows for multiple inheritance but only within
a common hierarchy.
With regards from Maurice.
Import Notes
Resolved by subject fallback
-----Original Message-----
From: Maurice Gittens <mgittens@gits.nl>
To: hackers@postgreSQL.org <hackers@postgreSQL.org>
Date: zondag 5 april 1998 21:47
Subject: [HACKERS] On improving OO support in posgresql and relaxing oid
bottleneck at the same timetop most base class. In which all tuples are stored. Otherwise
we'll storagefile independant oid's will become necesary again.So loosely speaking it still allows for multiple inheritance but only within
a common hierarchy.
Just for everyones information. In Illustra, an oid is 64 bits. The low
order 32 bits are (approximately), the row identifier within a table. The
high order 32bits the table identifier (which then works out to be the
same as the oid of the row in the tables table for the table in question).
Oids are unique for the life of the system. Limits are 4G tables, 4G rows
per table.
I for some reason have never bothered to remember, but I think inheritance
is done via separate tables.
-dg
David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- Linux. Not because it is free. Because it is better.
Maurice Gittens>
I'm currently under the impression that the following change in the
postgresql system would benefict the overall performance and quality
of the system.Tuples for a class and all it's derived classes stored in one file.
I hate to sound like a "small thinker" here, but I'd be concerned about
some issues:
...
2) Supporting inheritance using one-file storage probably leads to
larger overhead in _all_ file accesses, not just ones containing
inherited tables. Tuples would now contain a variable number of fields,
with variable definitions, with ... Ack! :)Yes but this overhead is very small for tables without inheritance.
An extra statement like:
Anything that gets done for every row is on _the_ critical path. Any extra
code here will have a performance penalty. We are already an order of
magnitude too slow on scans. Think in terms of a few hundred instructions
per row.
I will also say that table inheritance is rarely used in real applications.
Partly no doubt this is because the implementation is not wonderful, but
I also think that it may be one of those ideas like time travel that
sound great but in practice noone can figure out a use for it.
heap_getnext(ScanDesc scanDesc)
{
...
while(!done)
{
tuple = readTuple(...)
...
if (IsInstanceOf(tuple -> reloid, scanDesc -> reloid)
{
return tuple;
}
...
}The information on inheritance relations between classes can be precomputed
when a heap scandesc is created.This IMO this overhead is not significant, when there is no inheritance.
When there is inheritance we simple use indices to speed things up,
if it's deemed necesary.
I disagree, all per row overhead is significant. The primary operation in
the system is sifting rows.
But this is just the start of the extra overhead. What about the expression
evaluator trying to determine if this tuple matchs the where clause. Now it
has to determine column offset and type and "Equal" function etc for
each row.
-dg
David Gould dg@illustra.com 510.628.3783 or 510.305.9468
Informix Software (No, really) 300 Lakeside Drive Oakland, CA 94612
- Linux. Not because it is free. Because it is better.