Dreaming About Redesigning SQL

Started by Nonameover 22 years ago85 messages
#1Noname
seunosewa@inaira.com

Hi,

This is for relational database theory experts on one hand and
imlementers of real-world alications on the other hand. If there was
a chance to start again and design SQL afresh, for best
cleaness/power/performance what changes would you make? What would
_your_ query language (and the underlying database concept) look like?

Seun Osewa
PS: I should want to post my ideas too for review but more
experienced/qualified people should come first

#2Christopher Browne
cbbrowne@acm.org
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

After takin a swig o' Arrakan spice grog, seunosewa@inaira.com (Seun Osewa) belched out...:

This is for relational database theory experts on one hand and
imlementers of real-world alications on the other hand. If there was
a chance to start again and design SQL afresh, for best
cleaness/power/performance what changes would you make? What would
_your_ query language (and the underlying database concept) look
like?

There are two notable 'projects' out there:

1. There's Darwen and Date's "Tutorial D" language, defined as part
of their "Third Manifesto" about relational databases.

2. newSQL <http://newsql.sourceforge.net/&gt;, where they are studying
two syntaxes, one based on Java, and one based on a
simplification (to my mind, oversimplification) of SQL.

The "newSQL" project suffers from their definition being something of
a "chip away everything that doesn't look like an elephant"
definition. They aren't defining, in "mathematical" terms, what their
language is supposed to be able to express; they are instead defining
a big grab-bag of minor syntactic features, and seem to expect that a
database system will emerge from that.

In contrast, "Tutorial D" is _all_ about mathematical definition of
what it is supposed to express, and the text is a tough read,
irrespective of other merits.
--
wm(X,Y):-write(X),write('@'),write(Y). wm('cbbrowne','cbbrowne.com').
http://cbbrowne.com/info/thirdmanifesto.html
DOS: n., A small annoying boot virus that causes random spontaneous
system crashes, usually just before saving a massive project. Easily
cured by Unix. See also MS-DOS, IBM-DOS, DR-DOS.
-- from David Vicker's .plan

#3Noname
seunosewa@inaira.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

Thanks for the links.

Christopher Browne <cbbrowne@acm.org> wrote in message news:<blkq9n$d9puv$4@ID-125932.news.uni-berlin.de>...

There are two notable 'projects' out there:

1. There's Darwen and Date's "Tutorial D" language, defined as part
of their "Third Manifesto" about relational databases.

2. newSQL <http://newsql.sourceforge.net/&gt;, where they are studying
two syntaxes, one based on Java, and one based on a
simplification (to my mind, oversimplification) of SQL.

I was able to get a pdf coy of the "Third Manifesto" article here:
http://citeseer.nj.nec.com/darwen95third.html
but the details of tutorial D seem not to be a part of that article.
NewSQL *might* be cool if someone found reason to use it in a DBMS.

Sometimes I wonder why its so important to model data in the "rela-
tional way", to think of data in form of sets of tuples rather than
tables or lists or whatever. I mean, though its elegant and based
on mathematical principles I would like to know why its the _right_
model to follow in designing a DBMS (or database). The way my mind
sees it, should we not rather be interested in what works?

Seun Osewa

#4Mike Mascari
mascarm@mascari.com
In reply to: Christopher Browne (#2)
Re: Dreaming About Redesigning SQL

Christopher Browne wrote:

After takin a swig o' Arrakan spice grog, seunosewa@inaira.com (Seun Osewa) belched out...:

This is for relational database theory experts on one hand and
imlementers of real-world alications on the other hand. If there was
a chance to start again and design SQL afresh, for best
cleaness/power/performance what changes would you make? What would
_your_ query language (and the underlying database concept) look
like?

There are two notable 'projects' out there:

1. There's Darwen and Date's "Tutorial D" language, defined as part
of their "Third Manifesto" about relational databases.

I read the Third Manifesto. There are many ideas in the TTM that have
strong arguments, although I most confess I haven't read any
critiques. A few (of many) points:

1) Strict adherence to the relational model, where all of SQL's
short-comings are addressed:

A) No attribute ordering
B) No tuple ordering (sets aren't ordered)
C) No duplicate tuples (relations are sets)
D) No nulls (2VL sufficient. Missing information is meta-data)
E) No nullogical mistakes (ex: SUM of an empty relation is zero, AVG
is an error)
F) Generalized transitive closure
G) Declared attribute, relation variable, and database constraints,
including transition constraints
H) Candidate keys required (this has positive logical consequences for
the DBMS implementor)
I) Tuple and relation-valued attributes
J) No tuple-level operations

a bunch more...

2) The query language should be computationally complete. The user
should be able to author complete applications in the language, rather
than the language being a sublanguage. This reverses Codd's query
sublanguage proposed in "A Relational Model of Data for Large Shared
Data Banks"

http://www.acm.org/classics/nov95/s1p5.html

<sarcasm>
Thanks ACM for just putting part of the paper on-line, complete with
broken links and spelling errors!
</sarcasm>

3) The language (a D implementation) would ensure a separation between
the logical design of the application and the physical implementation.
The programmer should think in terms of the evaluation of relational
algebraic expressions, not manipulating physical records in disk
blocks in a file.

4) The type system should separate the actual, internal representation
from the possible representation, of which there might be many. For
example, a POINT may be internally expressed in cartesian coordinates
but may supply both polar and cartensian THE_ operators.

5) The type system should implement D & D's view of multiple
inheritance, where read-operators are inherited but write-operators
aren't. This eliminates the "Is a Circle an Ellipse?" dilemma imposed
by C++, for example. IOW, in a "D" language, a Circle is an Ellipse.

They reject Stonebreaker's ideas of OIDs and relation variable
inheritance, which of course, are in PostgreSQL.

It's a very provocative read. At a minimum, one can learn what to
avoid with SQL. The language looks neat on paper. Perhaps one day
someone will provide an open source implementation. One could envision
a "D" project along the same lines as the same sort of project that
added SQL to Postgres...

But I'd rather have PITR :-)

Mike Mascari
mascarm@mascari.com

#5Hannu Krosing
hannu@tm.ee
In reply to: Mike Mascari (#4)
Re: Dreaming About Redesigning SQL

Mike Mascari kirjutas L, 04.10.2003 kell 06:32:

2) The query language should be computationally complete. The user
should be able to author complete applications in the language, rather
than the language being a sublanguage.

To me it seems like requiring that one should be able to author complete
programs in regex.

Yes, when all you have is a hammer everything looks like a nail ;)

----------------
Hannu

#6Christopher Browne
cbbrowne@acm.org
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

The world rejoiced as mascarm@mascari.com (Mike Mascari) wrote:

It's a very provocative read. At a minimum, one can learn what to
avoid with SQL. The language looks neat on paper. Perhaps one day
someone will provide an open source implementation. One could envision
a "D" project along the same lines as the same sort of project that
added SQL to Postgres...

I think you summed it up nicely. The "manifesto" is a provocative, if
painful, read. It is very useful at pointing out "pointy edges" of
SQL that might be wise to avoid.

I'm not thrilled with the language; I think they have made a mistake
in trying to make it too abbreviation-oriented. They keep operator
names short, to a fault.

As you say, the most likely way for a "D" to emerge in a popular way
would be by someone adding the language to an existing database
system.

There is a project out on SourceForge for a "D implementation," called
"Duro." It takes the opposite approach; the operators are all defined
as C functions, so you write all your code in C. It uses a data store
built atop Berkeley DB.

I think an implementor would be better off using an SQL database
underneath, and using their code layer in between to accomplish the
"divorce" from the aspects of SQL that they disapprove of. Sort of
like MaVerick, a Pick implementation in Java that uses a DBMS such as
PostgreSQL as the underlying data store.

You do a "proof of concept" by building something that translates D
requests to SQL requests. And _then_ get a project going to put a "D
parser" in as an alternative to the SQL parser. (Yes, that
oversimplifies matters. Tough...)
--
let name="cbbrowne" and tld="ntlug.org" in name ^ "@" ^ tld;;
http://www3.sympatico.ca/cbbrowne/rdbms.html
Rules of the Evil Overlord #81. "If I am fighting with the hero atop a
moving platform, have disarmed him, and am about to finish him off and
he glances behind me and drops flat, I too will drop flat instead of
quizzically turning around to find out what he saw."
<http://www.eviloverlord.com/&gt;

#7Lee Fesperman
firstsql@ix.netcom.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

Seun Osewa wrote:

Sometimes I wonder why its so important to model data in the "rela-
tional way", to think of data in form of sets of tuples rather than
tables or lists or whatever. I mean, though its elegant and based
on mathematical principles I would like to know why its the _right_
model to follow in designing a DBMS (or database). The way my mind
sees it, should we not rather be interested in what works?

Relational is the _right_ model because 'it works'. It's the only truly comprehensive
data model and subject of decades of research. All other data models have been found to
be flawed and (nearly) discarded.

If you don't care for mathematical principles, there's always ad-hoc database models.
Check out Pick, OO and XML databases. They're interested in what works and ignore
elegance and mathematical principles.

--
Lee Fesperman, FirstSQL, Inc. (http://www.firstsql.com)
==============================================================
* The Ultimate DBMS is here!
* FirstSQL/J Object/Relational DBMS (http://www.firstsql.com)

#8Noname
lauri.pietarinen@atbusiness.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

Christopher Browne <cbbrowne@acm.org> wrote in message news:<m3lls1vzfc.fsf@wolfe.cbbrowne.com>...

I think an implementor would be better off using an SQL database
underneath, and using their code layer in between to accomplish the
"divorce" from the aspects of SQL that they disapprove of.

That is, in fact, the approach taken in a product called Dataphor
(see www.alphora.com). They have implemented a "D"-language (called D4)
that translates into SQL and hence uses underlying SQLServer, Oracle
or DB2- DBMS'es as the engine.

It is, however, not a very easy mapping to do and you have to resort
to all sorts of unclean stuff to make it work...

regards,
Lauri Pietarinen

#9Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <ba87a3cf.0310031759.42dce77c@posting.google.com>, Seun Osewa
<seunosewa@inaira.com> writes

Thanks for the links.

Christopher Browne <cbbrowne@acm.org> wrote in message news:<blkq9n$d9puv$4@ID-
125932.news.uni-berlin.de>...

There are two notable 'projects' out there:

1. There's Darwen and Date's "Tutorial D" language, defined as part
of their "Third Manifesto" about relational databases.

2. newSQL <http://newsql.sourceforge.net/&gt;, where they are studying
two syntaxes, one based on Java, and one based on a
simplification (to my mind, oversimplification) of SQL.

I was able to get a pdf coy of the "Third Manifesto" article here:
http://citeseer.nj.nec.com/darwen95third.html
but the details of tutorial D seem not to be a part of that article.
NewSQL *might* be cool if someone found reason to use it in a DBMS.

Is Darwen and Date's stuff that where they said SQL was crap. As I
understand it, within about a year of designing SQL, at least one of
Codd and Date said it was rubbish and tried to replace it with something
"better".

Sometimes I wonder why its so important to model data in the "rela-
tional way", to think of data in form of sets of tuples rather than
tables or lists or whatever. I mean, though its elegant and based
on mathematical principles I would like to know why its the _right_
model to follow in designing a DBMS (or database). The way my mind
sees it, should we not rather be interested in what works?

I couldn't agree more (of course I would). As I like to put it, surely
Occam's Razor says that stuffing the four-dimensional world into a flat-
earth database can't be the optimal solution!

The trouble with so many SQL advocates is that they are so convinced in
the mathematical rightness of the relational model, that they forget it
is a *model* and, as such, needs to be shown as relevant to the real
world.

That said, I always think relationally when designing databases - it
helps. Look at the multi-value databases. Think relationally, you can
still store your data in normal form, but you're not stuffed by all the
irrelevant restrictions that relational databases tend to impose.

Get a freebie copy of jBASE, UniVerse or UniData, and try them out :-)

Cheers,
Wol
--
Anthony W. Youngman <pixie@thewolery.demon.co.uk>
'Yings, yow graley yin! Suz ae rikt dheu,' said the blue man, taking the
thimble. 'What *is* he?' said Magrat. 'They're gnomes,' said Nanny. The man
lowered the thimble. 'Pictsies!' Carpe Jugulum, Terry Pratchett 1998
Visit the MaVerick web-site - <http://www.maverick-dbms.org&gt; Open Source Pick

#10Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <3F7F8E1A.474@ix.netcom.com>, Lee Fesperman
<firstsql@ix.netcom.com> writes

If you don't care for mathematical principles, there's always ad-hoc database
models.
Check out Pick, OO and XML databases. They're interested in what works and
ignore
elegance and mathematical principles.

Mathematical principles? You mean like Euclidean Geometry and Newtonian
Mechanics? They're perfectly solid, good, mathematically correct. Shame
they don't actually WORK all the time in the real world.

That's what I feel about relational, too ...

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#11Noname
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Anthony W. Youngman" <thewolery@nospam.demon.co.uk> wrote in message news:<xTDLP1CFRIg$Ewjw@thewolery.demon.co.uk>...

In article <3F7F8E1A.474@ix.netcom.com>, Lee Fesperman
<firstsql@ix.netcom.com> writes

If you don't care for mathematical principles, there's always ad-hoc database
models.
Check out Pick, OO and XML databases. They're interested in what works and
ignore
elegance and mathematical principles.

Mathematical principles? You mean like Euclidean Geometry and Newtonian
Mechanics? They're perfectly solid, good, mathematically correct. Shame
they don't actually WORK all the time in the real world.

That's what I feel about relational, too ...

That explains the generally poor quality of your posts. You substitute
emotion for reason.

#12Mike Sherrill
MSherrill@compuserve.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

On 3 Oct 2003 21:39:03 GMT, Christopher Browne <cbbrowne@acm.org>
wrote:

There are two notable 'projects' out there:

1. There's Darwen and Date's "Tutorial D" language, defined as part
of their "Third Manifesto" about relational databases.

2. newSQL <http://newsql.sourceforge.net/&gt;, where they are studying
two syntaxes, one based on Java, and one based on a
simplification (to my mind, oversimplification) of SQL.

ISTR that Terry Halpin (of ORM fame) designed a language named
"ConQuer". I don't know the details, but I think Date's latest
edition refers to it in a note. Halpin's working on Visio at
Microsoft now, I think.

--
Mike Sherrill
Information Management Systems

#13Dann Corbit
DCorbit@connx.com
In reply to: Mike Sherrill (#12)
Re: Dreaming About Redesigning SQL

-----Original Message-----
From: Seun Osewa [mailto:seunosewa@inaira.com]
Sent: Friday, October 03, 2003 11:52 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Dreaming About Redesigning SQL

Hi,

This is for relational database theory experts on one hand
and imlementers of real-world alications on the other hand.
If there was a chance to start again and design SQL afresh,
for best cleaness/power/performance what changes would you
make? What would _your_ query language (and the underlying
database concept) look like?

Seun Osewa
PS: I should want to post my ideas too for review but more
experienced/qualified people should come first

I imagine you have read the 3rd database manifesto by Codd. I think
he's gone off the deep end a bit. You don't just throw away a trillion
dollars worth of effort and tools to make things mathematically
orthogonal.

However, on some things he is clearly right. For instance, null values
are evil. Programmers understand it, but end users will *always* be
surprised that:

SELECT COUNT(shirts) FROM clothing WHERE color = 'blue'
SELECT COUNT(shirts) FROM clothing WHERE NOT color = 'blue'

Is not equal to the number of shirts in the inventory if any color
fields are null.

Therefore, his idea of using default values instead and never using null
is a good one.

If SQL vendors would follow the ANSI/ISO standard to the letter, and
implement the latest iteration, that would solve all of the problems
that SQL tool users have to face.

#14Noname
seunosewa@inaira.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

I have tried, twice, to download the evaluation version of the alphora
product for testing and it doesn't work. Guess there would be a lot
to learn from playing with it; the product is more than a RDBMS

Regards,
Seun Osewa

lauri.pietarinen@atbusiness.com (Lauri Pietarinen) wrote:

Show quoted text

That is, in fact, the approach taken in a product called Dataphor
(see www.alphora.com). They have implemented a "D"-language (called D4)
that translates into SQL and hence uses underlying SQLServer, Oracle
or DB2- DBMS'es as the engine.

regards,
Lauri Pietarinen

#15Noname
bbadour@golden.net
In reply to: Dann Corbit (#13)
Re: Dreaming About Redesigning SQL

DCorbit@connx.com ("Dann Corbit") wrote in message news:<D90A5A6C612A39408103E6ECDD77B829408BE9@voyager.corporate.connx.com>...

-----Original Message-----
From: Seun Osewa [mailto:seunosewa@inaira.com]
Sent: Friday, October 03, 2003 11:52 AM
To: pgsql-hackers@postgresql.org
Subject: [HACKERS] Dreaming About Redesigning SQL

Hi,

This is for relational database theory experts on one hand
and imlementers of real-world alications on the other hand.
If there was a chance to start again and design SQL afresh,
for best cleaness/power/performance what changes would you
make? What would your query language (and the underlying
database concept) look like?

Seun Osewa
PS: I should want to post my ideas too for review but more
experienced/qualified people should come first

I imagine you have read the 3rd database manifesto by Codd. I think
he's gone off the deep end a bit.

Dann, you are showing your ignorance. While Dr. Codd recently died, if
you think he wrote a third database manifesto, you have definitely
gone off the deep end yourself.

You don't just throw away a trillion
dollars worth of effort and tools to make things mathematically
orthogonal.

Again, you are showing your ignorance. Nobody has ever suggested
anything even remotely resembling the above.

However, on some things he is clearly right. For instance, null values
are evil.

Dr. Codd believed we need two NULLs. You ascribe correctness to the
one thing I think he clearly got wrong.

Programmers understand it

That's an absurd assertion.

Therefore, his idea of using default values instead and never using null
is a good one.

That is not his idea.

If SQL vendors would follow the ANSI/ISO standard to the letter, and
implement the latest iteration, that would solve all of the problems
that SQL tool users have to face.

Upon what do you base this ridiculous opinion?

#16Noname
lauri.pietarinen@atbusiness.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

seunosewa@inaira.com (Seun Osewa) wrote in message news:<ba87a3cf.0310080256.11846ef3@posting.google.com>...

I have tried, twice, to download the evaluation version of the alphora
product for testing and it doesn't work. Guess there would be a lot
to learn from playing with it; the product is more than a RDBMS

Aw, that's unfortunate. It took me a while to get working.
It is infact an integrated application development environment where
you can define a great part of your application in a declarative
fashion.

regards,
Lauri Pietarinen

#17Noname
dwolt@iserv.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

Good question. Although I would want to move away from relational
databases too, if there is an RDBMS and one wants to query it, what
would I aim for? If you look at XQuery, you will see an example of
what I would definitely NOT aim for. Although the user of such a
language might very well be a technical person, instead of starting
with mathematics (relational calculus, relational algebra) I would
suggest starting with language. The mathematics of language is more
complex than the mathematics of relations, particularly simple
relations (such as 1NF tables).

If you look at the history of data persistence prior to Codd's 1970
ACM paper, you will see several attempts at this. One I have studied
of late is GIRLS (Generalized Information Retrieval Language and
System), specified by Don Nelson and implemented by several folks with
the most famous being Dick Pick. This GIRLS language was specified a
full 40 years ago and lives today in many IT shops under a variety of
about 10 different names, including IBM's UniQuery and Retrieve (for
UniData and Universe respectively). This language is flawed, as are
all, but so very close to what I would think would be a good approach.
It was written at TRW in order to make it so that the military in
Viet Nam could query their data without technical folks in the field.
It went into production in 1969 with the US Army. Prior to the end of
the cold war, it was used by the CIA to track (the associated
database) and query about Russion spies in the US.

I would suggest ditching the entire relational model (as both overly
simplistic in its theory and overly complex in its implementation) and
start with English (that is one of the other names for the GIRLS
language). Note that language is also the starting point for putting
data in XML documents, but it sure doesn't seem to be the starting
point for XQuery, eh?

--dawn
Dawn M. Wolthuis
www.tincat-group.com

#18Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

With all due respect, Dawn, you are an idiot.

"Dawn M. Wolthuis" <dwolt@iserv.net> wrote in message
news:6db906b2.0310091212.4f967cf5@posting.google.com...

Show quoted text

Good question. Although I would want to move away from relational
databases too, if there is an RDBMS and one wants to query it, what
would I aim for? If you look at XQuery, you will see an example of
what I would definitely NOT aim for. Although the user of such a
language might very well be a technical person, instead of starting
with mathematics (relational calculus, relational algebra) I would
suggest starting with language. The mathematics of language is more
complex than the mathematics of relations, particularly simple
relations (such as 1NF tables).

If you look at the history of data persistence prior to Codd's 1970
ACM paper, you will see several attempts at this. One I have studied
of late is GIRLS (Generalized Information Retrieval Language and
System), specified by Don Nelson and implemented by several folks with
the most famous being Dick Pick. This GIRLS language was specified a
full 40 years ago and lives today in many IT shops under a variety of
about 10 different names, including IBM's UniQuery and Retrieve (for
UniData and Universe respectively). This language is flawed, as are
all, but so very close to what I would think would be a good approach.
It was written at TRW in order to make it so that the military in
Viet Nam could query their data without technical folks in the field.
It went into production in 1969 with the US Army. Prior to the end of
the cold war, it was used by the CIA to track (the associated
database) and query about Russion spies in the US.

I would suggest ditching the entire relational model (as both overly
simplistic in its theory and overly complex in its implementation) and
start with English (that is one of the other names for the GIRLS
language). Note that language is also the starting point for putting
data in XML documents, but it sure doesn't seem to be the starting
point for XQuery, eh?

--dawn
Dawn M. Wolthuis
www.tincat-group.com

#19Anith  Sen
anith@bizdatasolutions.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

The mathematics of language is more complex than the mathematics of

relations, particularly simple relations (such as 1NF tables). <<

Are you sure, you know what you are talking about?

I would suggest ditching the entire relational model (as both overly

simplistic in its theory and overly complex in its implementation.. <<

Incredible! How about reading some books on the subject?

--
-- Anith
( Please reply to newsgroups only )

#20Christopher Browne
cbbrowne@acm.org
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

A long time ago, in a galaxy far, far away, dwolt@iserv.net (Dawn M. Wolthuis) wrote:

Good question. Although I would want to move away from relational
databases too, if there is an RDBMS and one wants to query it, what
would I aim for? If you look at XQuery, you will see an example of
what I would definitely NOT aim for. Although the user of such a
language might very well be a technical person, instead of starting
with mathematics (relational calculus, relational algebra) I would
suggest starting with language. The mathematics of language is more
complex than the mathematics of relations, particularly simple
relations (such as 1NF tables).

No, that _very much_ gets things backwards.

You need to have a clearly defined model of how the data is to be
manipulated before it makes any sense to try to make up a language.
--
output = reverse("moc.enworbbc" "@" "enworbbc")
http://cbbrowne.com/info/rdbms.html
To iterate is human; to recurse, divine.

#21Noname
seunosewa@inaira.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

dwolt@iserv.net (Dawn M. Wolthuis) wrote in message news:<6db906b2.0310091212.4f967cf5@posting.google.com>...

I would suggest ditching the entire relational model (as both overly
simplistic in its theory and overly complex in its implementation) and
start with English (that is one of the other names for the GIRLS
language). Note that language is also the starting point for putting
data in XML documents, but it sure doesn't seem to be the starting
point for XQuery, eh?

--dawn
Dawn M. Wolthuis
www.tincat-group.com

Please explain further. What do you really mean? Its natural for
everyone here to think every word in that post was a troll unless you
explain your views more clearly. You could not have expressed a more
unpopular/unsupportable combination of ideas! Exactly how would we go
about using language as a query tool? Is this AI? What would the
underlying model be knowing how redundant and imprecise language can
be? Tell what we may have missed.

Seun Osewa

#22Noname
dwolt@iserv.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

While I definitely agree that the mathematics of the data persistence
mechanism is not as important to me as whether it works or not, as a
former mathematician, I have done a little study related to the
mathematics of non-relational approaches, such as PICK (the one both
Wol and I have been know to advocate on behalf of).

These models tend to start with language rather than mathematics. So,
what started out as my attempt to show such things as the fact that a
PICK file is more like a mathematical RELATION than an RDBMS table, I
ended up studying the mathematics of language for a short time - one
can see that the mathematics of language, which is what we are storing
when working with text-based objects, is much more complex than simple
relations.

By the way, in case you are wondering how PICK files are more like
mathematical relations -- they do have a numbered position for each
domain (in other words, there is a location for each column within a
row as there is a location in a PICK ITEM/RECORD); they do not by
default request a constraint on the length of values in a given domain
(a quite unnecessary database constraint); and they permit relations
as elements within a relation -- there is no mathematical requirement
that a relation be in first normal form, for example.

I do tire of the thought that a database premised on the relational
model is somehow more mathematically accurate than those premised on a
language model. PICK, like XML, was used to make it easy to think
about storing and retrieving text. If you set aside the need for
storing other objects for now and focus on text-based data
persistence, it is simply a means to persist propositions. If one were
to normalize your sentences before you said them, you might guess that
people would have a harder time figuring out what you were saying.
Similarly, normalizing data before persisting it tends to make it
difficult to retrieve the original propositions, reconstructing
language from normalized data.

It's time to move on from the relational model -- it was a good
academic exercise, but has not proven a very agile means for
persisting and retrieving propositions, methinks. smiles. --dawn

"Anthony W. Youngman" <thewolery@nospam.demon.co.uk> wrote in message news:<xTDLP1CFRIg$Ewjw@thewolery.demon.co.uk>...

Show quoted text

In article <3F7F8E1A.474@ix.netcom.com>, Lee Fesperman
<firstsql@ix.netcom.com> writes

If you don't care for mathematical principles, there's always ad-hoc database
models.
Check out Pick, OO and XML databases. They're interested in what works and
ignore
elegance and mathematical principles.

Mathematical principles? You mean like Euclidean Geometry and Newtonian
Mechanics? They're perfectly solid, good, mathematically correct. Shame
they don't actually WORK all the time in the real world.

That's what I feel about relational, too ...

Cheers,
Wol

#23Noname
dwolt@iserv.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

Thank you, Seun, for asking your question with a bit of logic and not
gut-reaction emotional baggage (and for also asking a question of me
off-list so I could ramble). I'll try to make this more suscinct.

First of all, I have read Codd's 1970 & 1974 ACM papers, as well as
his "The Relational Model for Database Management, Version 2" book. I
have also read several books by Chris Date and by Fabian Pascal. I
held a dialog (that reads more like a monologue) with Pascal and it is
reproduced in total if you scroll to the bottom of the
http://store.tincat-group.com page and click on the Dick Pick / Ted
Codd Blue Brothers parody picture. I had not been reading this news
group until lately, but it strikes me that this is a group that might
be very entertained by that dialog.

I have a master's degree in mathematics and my father is a linguist.
I find the language of mathematics and the mathematics of language
both fascinating. My experience, however, is that I have run IT
project teams working with a variety of databases (and languages) and
have never seen any environment that is as agile for develoeprs (both
productive from the start and easy to maintain) than the teams I have
led that worked with the UniData database. I have worked with SQL as
well as both older and newer database languages.

So, nope, I'm not trolling. I've been doing some research the past
couple of years and I'm convinced that it is time to do something new
(and yet old) with data persistence.

I favor using Java for a variety of reasons, but am comfortable with
other languages as well, and think that using Java both for the
software application and for the constraints on the data, rather than
encoding constraints in some other language within a database, makes
for both a more agile development approach AND, surprisingly enough,
tends to make for better data integrity, although a lousy software
developer can certainly mess up either environment. Separating the
DBA from the software developer has definitely had a negative affect
on the speed with which software is developed and maintained, but my
experience (and intuition -- I don't, yet, have scientific evidence)
tells me that the benefits purported by the approach of having a dba
work on some centralized constraints on the persisted data outside of
the context of the use of that data have not really come to fruition
and/or are not worth the costs of using this approach (more on that
some other time).

So, while some might classify me as an idiot (men can be so emotional
sometimes ;-), I have several graduate classes in logic to my credit
and believe that I am approaching this topic quite logically, even if
my summaries skip some of the logical steps in the process. I have
thought about how to prove my points and since my point is really
about agility and quality in application software development and
maintenance, a competition to see what tools and techniques and what
data persistence approaches win such a competition might be the best
proof. The current industry benchmarks for databases tend to be
SQL-based and highly political, so let's put different approaches to
the test.

Thanks for asking your question and not just assuming I'm a nut
because I disagree with the current state of the industry on this
topic. I'm sure there are gaps in my thinking and I know some of my
opinions are based on intuition that arises from my experience, but I
do hope to have more proof in the future. I am also very willing to
adjust my opinions with convincing arguments and evidence and trust
that there are some on this list who work similarly.
--dawn

seunosewa@inaira.com (Seun Osewa) wrote in message news:<ba87a3cf.0310092217.72098544@posting.google.com>...

Show quoted text

dwolt@iserv.net (Dawn M. Wolthuis) wrote in message news:<6db906b2.0310091212.4f967cf5@posting.google.com>...

I would suggest ditching the entire relational model (as both overly
simplistic in its theory and overly complex in its implementation) and
start with English (that is one of the other names for the GIRLS
language). Note that language is also the starting point for putting
data in XML documents, but it sure doesn't seem to be the starting
point for XQuery, eh?

--dawn
Dawn M. Wolthuis
www.tincat-group.com

Please explain further. What do you really mean? Its natural for
everyone here to think every word in that post was a troll unless you
explain your views more clearly. You could not have expressed a more
unpopular/unsupportable combination of ideas! Exactly how would we go
about using language as a query tool? Is this AI? What would the
underlying model be knowing how redundant and imprecise language can
be? Tell what we may have missed.

Seun Osewa

#24Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Dawn M. Wolthuis" <dwolt@iserv.net> wrote in message
news:6db906b2.0310100813.65b872d8@posting.google.com...

Thank you, Seun, for asking your question with a bit of logic and not
gut-reaction emotional baggage (and for also asking a question of me
off-list so I could ramble). I'll try to make this more suscinct.

With all due respect, Dawn, you are an idiot. That statement has nothing to
do with my viscera or with my emotions; I write it with complete dispassion.
You simply are an idiot, and Seun simply lacks sufficient education to
recognize that fact from what you have written.

#25Noname
dwolt@iserv.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

Bob Badour" <bbadour@golden.net> wrote in message news:<O6WdnZ4P-_6I4RWiXTWJlg@golden.net>...

"Dawn M. Wolthuis" <dwolt@iserv.net> wrote in message
news:6db906b2.0310100813.65b872d8@posting.google.com...

Thank you, Seun, for asking your question with a bit of logic and not
gut-reaction emotional baggage (and for also asking a question of me
off-list so I could ramble). I'll try to make this more suscinct.

With all due respect, Dawn, you are an idiot. That statement has nothing to
do with my viscera or with my emotions; I write it with complete dispassion.
You simply are an idiot, and Seun simply lacks sufficient education to
recognize that fact from what you have written.

Thanks for the clarification, Bob. What seemed like an emotional
outburst from me was actually a logically-derived statement regarding
my capacity then, right? If I were you, I might be just a tad bit
concerned that I have taught many engineers calculus and you are
flying on planes or driving over bridges they designed. Let's hope
they learned it well in spite of me, eh? smiles.
--dawn

Dawn M. Wolthuis

#26Noname
michael@preece.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

dwolt@iserv.net (Dawn M. Wolthuis) wrote in message news:<6db906b2.0310130618.6920e737@posting.google.com>...

Bob Badour" <bbadour@golden.net> wrote in message news:<O6WdnZ4P-_6I4RWiXTWJlg@golden.net>...

"Dawn M. Wolthuis" <dwolt@iserv.net> wrote in message
news:6db906b2.0310100813.65b872d8@posting.google.com...

Thank you, Seun, for asking your question with a bit of logic and not
gut-reaction emotional baggage (and for also asking a question of me
off-list so I could ramble). I'll try to make this more suscinct.

With all due respect, Dawn, you are an idiot. That statement has nothing to
do with my viscera or with my emotions; I write it with complete dispassion.
You simply are an idiot, and Seun simply lacks sufficient education to
recognize that fact from what you have written.

Thanks for the clarification, Bob. What seemed like an emotional
outburst from me was actually a logically-derived statement regarding
my capacity then, right? If I were you, I might be just a tad bit
concerned that I have taught many engineers calculus and you are
flying on planes or driving over bridges they designed. Let's hope
they learned it well in spite of me, eh? smiles.
--dawn

Dawn M. Wolthuis

Dawn: I'm sure you're not an idiot.

Bob: Do pigs really like people to wrestle in mud with them?

#27Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Mike Preece" <michael@preece.net> wrote in message
news:1b0b566c.0310131551.47ca0d47@posting.google.com...

dwolt@iserv.net (Dawn M. Wolthuis) wrote in message

news:<6db906b2.0310130618.6920e737@posting.google.com>...

Bob Badour" <bbadour@golden.net> wrote in message

news:<O6WdnZ4P-_6I4RWiXTWJlg@golden.net>...

"Dawn M. Wolthuis" <dwolt@iserv.net> wrote in message
news:6db906b2.0310100813.65b872d8@posting.google.com...

Thank you, Seun, for asking your question with a bit of logic and

not

gut-reaction emotional baggage (and for also asking a question of me
off-list so I could ramble). I'll try to make this more suscinct.

With all due respect, Dawn, you are an idiot. That statement has

nothing to

do with my viscera or with my emotions; I write it with complete

dispassion.

You simply are an idiot, and Seun simply lacks sufficient education to
recognize that fact from what you have written.

Thanks for the clarification, Bob. What seemed like an emotional
outburst from me was actually a logically-derived statement regarding
my capacity then, right? If I were you, I might be just a tad bit
concerned that I have taught many engineers calculus and you are
flying on planes or driving over bridges they designed. Let's hope
they learned it well in spite of me, eh? smiles.
--dawn

Dawn M. Wolthuis

Dawn: I'm sure you're not an idiot.

Mike, you have expressed certainty contradicting the patently obvious
before. Your opinion is only a useful gauge insofar as you are wrong so much
more frequently than you are correct.

#28Noname
michael@preece.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Bob Badour" <bbadour@golden.net> wrote in message news:<XJGdnSwNRq8UxxaiU-KYuA@golden.net>...

"Mike Preece" <michael@preece.net> wrote in message
news:1b0b566c.0310131551.47ca0d47@posting.google.com...

dwolt@iserv.net (Dawn M. Wolthuis) wrote in message

news:<6db906b2.0310130618.6920e737@posting.google.com>...

Bob Badour" <bbadour@golden.net> wrote in message

news:<O6WdnZ4P-_6I4RWiXTWJlg@golden.net>...

"Dawn M. Wolthuis" <dwolt@iserv.net> wrote in message
news:6db906b2.0310100813.65b872d8@posting.google.com...

Thank you, Seun, for asking your question with a bit of logic and

not

gut-reaction emotional baggage (and for also asking a question of me
off-list so I could ramble). I'll try to make this more suscinct.

With all due respect, Dawn, you are an idiot. That statement has

nothing to

do with my viscera or with my emotions; I write it with complete

dispassion.

You simply are an idiot, and Seun simply lacks sufficient education to
recognize that fact from what you have written.

Thanks for the clarification, Bob. What seemed like an emotional
outburst from me was actually a logically-derived statement regarding
my capacity then, right? If I were you, I might be just a tad bit
concerned that I have taught many engineers calculus and you are
flying on planes or driving over bridges they designed. Let's hope
they learned it well in spite of me, eh? smiles.
--dawn

Dawn M. Wolthuis

Dawn: I'm sure you're not an idiot.

Mike, you have expressed certainty contradicting the patently obvious
before. Your opinion is only a useful gauge insofar as you are wrong so much
more frequently than you are correct.

Dawn: Bob professes to respect referential integrity!?

#29Noname
genew@mail.ocis.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

seunosewa@inaira.com (Seun Osewa) wrote:

[snip]

Sometimes I wonder why its so important to model data in the "rela-
tional way", to think of data in form of sets of tuples rather than
tables or lists or whatever. I mean, though its elegant and based
on mathematical principles I would like to know why its the _right_
model to follow in designing a DBMS (or database). The way my mind
sees it, should we not rather be interested in what works?

How do you know it works? Without the theory and model, you
really do not.

Sincerely,

Gene Wirchenko

Computerese Irregular Verb Conjugation:
I have preferences.
You have biases.
He/She has prejudices.

#30Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <3f8cbee1.1656673@shawnews>, Gene Wirchenko
<genew@mail.ocis.net> writes

seunosewa@inaira.com (Seun Osewa) wrote:

[snip]

Sometimes I wonder why its so important to model data in the "rela-
tional way", to think of data in form of sets of tuples rather than
tables or lists or whatever. I mean, though its elegant and based
on mathematical principles I would like to know why its the _right_
model to follow in designing a DBMS (or database). The way my mind
sees it, should we not rather be interested in what works?

How do you know it works? Without the theory and model, you
really do not.

And don't other databases have both theory and model?

It's just that all the academics have been brainwashed into thinking
this is true only for relational, so that's what they teach to everyone
else, and the end result is that all research is ploughed into a model
that may be (I didn't say "is") bankrupt. Just like the academics were
brainwashed into thinking that microkernels were the be-all and end-all
- until Linus showed them by practical example that they were all idiots
:-)

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#31Noname
topmind@technologist.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

seunosewa@inaira.com (Seun Osewa) wrote in message news:<ba87a3cf.0310031052.77315052@posting.google.com>...

Hi,

This is for relational database theory experts on one hand and
imlementers of real-world alications on the other hand. If there was
a chance to start again and design SQL afresh, for best
cleaness/power/performance what changes would you make? What would
_your_ query language (and the underlying database concept) look like?

Seun Osewa
PS: I should want to post my ideas too for review but more
experienced/qualified people should come first

Some ideas, links, and complaints about SQL and remaking
or replacing it:

http://www.c2.com/cgi/wiki?SqlFlaws

-T-

#32Christopher Browne
cbbrowne@acm.org
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

Quoth "Anthony W. Youngman" <thewolery@nospam.demon.co.uk>:

In article <3f8cbee1.1656673@shawnews>, Gene Wirchenko
<genew@mail.ocis.net> writes

seunosewa@inaira.com (Seun Osewa) wrote:

[snip]

Sometimes I wonder why its so important to model data in the "rela-
tional way", to think of data in form of sets of tuples rather than
tables or lists or whatever. I mean, though its elegant and based
on mathematical principles I would like to know why its the _right_
model to follow in designing a DBMS (or database). The way my mind
sees it, should we not rather be interested in what works?

How do you know it works? Without the theory and model, you
really do not.

And don't other databases have both theory and model?

It's just that all the academics have been brainwashed into thinking
this is true only for relational, so that's what they teach to
everyone else, and the end result is that all research is ploughed
into a model that may be (I didn't say "is") bankrupt. Just like the
academics were brainwashed into thinking that microkernels were the
be-all and end-all - until Linus showed them by practical example
that they were all idiots :-)

In mathematics as well as in the analysis of computer algorithms, it
is typical for someone who is trying to explain something new to try
to do so in terms that allow the gentle reader to do as direct a
comparison as possible between the things with which they are familiar
(e.g. - in this case, relational database theory) and the things with
which they are perhaps NOT familiar (e.g. - in this case, MV
databases).

Nobody seems to have been prepared to explain the MV model in adequate
theoretical terms as to allow the gentle readers to compare the theory
behind it with the other theories out there.

I'm afraid that does not reflect very well on either those lauding MV
or those trashing it.

- Those lauding it have not made an attempt to show why the theory
behind it would support it being preferable to the other models
around.

I hear some vague "Oh, it's not about models; it's about language"
which doesn't get to the heart of anything.

- And all we get from Bob Badour are dismissive sound-bites that
_don't_ explain why he should be taken seriously. Indeed, the
sharper and shorter he gets, the less credible that gets.

There are no pointers to "Michael Stonebraker on Why Pick Is Not My
Favorite Database." Brian Kernighan felt the issues with Pascal
were important enough that he wrote a nice, approachable paper that
quite cogently describes the problems with Standard
Pascal. <http://www.lysator.liu.se/c/bwk-on-pascal.html&gt; He nicely
summarizes it with 9 points that fit on a sheet of paper.

If Bob wanted people to take him really seriously about this, and
has done all the research to back up the points that are apparently
so obvious to him, then it should surely be _easy_ to write up "Nine
Reasons Pick Isn't My Favorite Database System."

And just as people have been pointing back to Kernighan's paper on
Pascal for over 20 years, folks could point back to the "Pick"
essay.

But apparently it is much too difficult for anyone to present any
_useful_ discourse on it.
--
(reverse (concatenate 'string "ac.notelrac.teneerf" "@" "454aa"))
http://cbbrowne.com/info/nondbms.html
For a good prime call:
391581 * 2^216193 - 1
-- smr2@cornell.edu (Szymon Rusinkiewicz)

#33Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Christopher Browne" <cbbrowne@acm.org> wrote in message
news:bmpoap$oc47b$1@ID-125932.news.uni-berlin.de...

Quoth "Anthony W. Youngman" <thewolery@nospam.demon.co.uk>:

In article <3f8cbee1.1656673@shawnews>, Gene Wirchenko
<genew@mail.ocis.net> writes

seunosewa@inaira.com (Seun Osewa) wrote:

[snip]

Sometimes I wonder why its so important to model data in the "rela-
tional way", to think of data in form of sets of tuples rather than
tables or lists or whatever. I mean, though its elegant and based
on mathematical principles I would like to know why its the _right_
model to follow in designing a DBMS (or database). The way my mind
sees it, should we not rather be interested in what works?

How do you know it works? Without the theory and model, you
really do not.

And don't other databases have both theory and model?

It's just that all the academics have been brainwashed into thinking
this is true only for relational, so that's what they teach to
everyone else, and the end result is that all research is ploughed
into a model that may be (I didn't say "is") bankrupt. Just like the
academics were brainwashed into thinking that microkernels were the
be-all and end-all - until Linus showed them by practical example
that they were all idiots :-)

In mathematics as well as in the analysis of computer algorithms, it
is typical for someone who is trying to explain something new to try
to do so in terms that allow the gentle reader to do as direct a
comparison as possible between the things with which they are familiar
(e.g. - in this case, relational database theory) and the things with
which they are perhaps NOT familiar (e.g. - in this case, MV
databases).

Nobody seems to have been prepared to explain the MV model in adequate
theoretical terms as to allow the gentle readers to compare the theory
behind it with the other theories out there.

I'm afraid that does not reflect very well on either those lauding MV
or those trashing it.

- Those lauding it have not made an attempt to show why the theory
behind it would support it being preferable to the other models
around.

I hear some vague "Oh, it's not about models; it's about language"
which doesn't get to the heart of anything.

- And all we get from Bob Badour are dismissive sound-bites that
_don't_ explain why he should be taken seriously. Indeed, the
sharper and shorter he gets, the less credible that gets.

There are no pointers to "Michael Stonebraker on Why Pick Is Not My
Favorite Database." Brian Kernighan felt the issues with Pascal
were important enough that he wrote a nice, approachable paper that
quite cogently describes the problems with Standard
Pascal. <http://www.lysator.liu.se/c/bwk-on-pascal.html&gt; He nicely
summarizes it with 9 points that fit on a sheet of paper.

If Bob wanted people to take him really seriously about this, and
has done all the research to back up the points that are apparently
so obvious to him, then it should surely be _easy_ to write up "Nine
Reasons Pick Isn't My Favorite Database System."

And just as people have been pointing back to Kernighan's paper on
Pascal for over 20 years, folks could point back to the "Pick"
essay.

But apparently it is much too difficult for anyone to present any
_useful_ discourse on it.

How many times do I have to repeat the same points?

I dislike Pick because it lacks logical identity, confuses the physical and
the logical, lacks a robust query language, lacks physical independence,
lacks logical independence and damages brains.

#34Oliver Elphick
olly@lfix.co.uk
In reply to: Christopher Browne (#32)
Re: Dreaming About Redesigning SQL

On Fri, 2003-10-17 at 22:52, Christopher Browne wrote:

Nobody seems to have been prepared to explain the MV model in adequate
theoretical terms as to allow the gentle readers to compare the theory
behind it with the other theories out there.

I'm not convinced that there was a great deal of theory behind Dick
Pick's database design. It has always struck me as very pragmatic.

In terms of storage, the substantial difference between MV and
relational databases is that each MV file (=table) holds, or can hold,
what would be the result of a join in a relational database.
Wherever we use arrays in PostgreSQL, we effectively do the same thing.

The advantages of MV are that it is very simple to program and to define
your data. If you want another attribute (=column) you simply define a
new dictionary entry with a new attribute number; data output formatting
can be simply done by defining new dictionary items which present the
same data in a different way. The MV characteristic makes it very fast
to get (for example) a list of invoices for a particular customer, since
the list of invoice numbers can be kept as part of the customer record.

The disadvantages (at least of original PICK) are: there are no
constraints (not even by typecasting); there can be no relational
enquiries -- everything must be defined in the dictionary; the
environment is utterly undisciplined -- anything can be changed at will
with a simple text editor; even more than in MySQL, all data validation
must be done by programming; there is no requirement for a record in a
file to correspond at all to the structure defined in its dictionary;
finally, the security model was laughable.

The effects of this can be seen in many places whose applications are
based on PICK. There is usually a mass of programs of various ages,
with no certainty that all have the same view of the database
structure. The database design is often very amateurish; frequently it
truly is amateur, because the simplicity of programming makes it easy
for users to roll their own systems, but they usually lack the necessary
experience in data analysis. Security usually depends on user
ignorance; in UniVerse migrations of old PICK databases, I have often
seen entire directories of important data with 777 permissions, and with
everyone using the same login.

Good use of MV requires the imposition of disciplined programming in an
environment which is profoundly hostile to such discipline. It is not
really possible to guarantee data integrity.

There are some advances on this in some implementations. I know
UniVerse: it provides SQL and adds it on top of the existing MV
structure; it also provides transactions. These features give some of
the advantages of a relational database, provided that only SQL
facilities are used, but I doubt if many people have used UniVerse to
build SQL systems from scratch. I feel that SQL was provided more to
satisfy the box tickers who compare tenders than with a serious
intention of providing data integrity.

Having used both SQL and MV, I would not now design in any MV
implementation known to me a system whose data I valued.

--
Oliver Elphick Oliver.Elphick@lfix.co.uk
Isle of Wight, UK http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839 932A 614D 4C34 3E1D 0C1C
========================================
"Delight thyself also in the LORD; and he shall give
thee the desires of thine heart." Psalms 37:4

#35Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <bmpoap$oc47b$1@ID-125932.news.uni-berlin.de>, Christopher
Browne <cbbrowne@acm.org> writes

How do you know it works? Without the theory and model, you
really do not.

And don't other databases have both theory and model?

It's just that all the academics have been brainwashed into thinking
this is true only for relational, so that's what they teach to
everyone else, and the end result is that all research is ploughed
into a model that may be (I didn't say "is") bankrupt. Just like the
academics were brainwashed into thinking that microkernels were the
be-all and end-all - until Linus showed them by practical example
that they were all idiots :-)

In mathematics as well as in the analysis of computer algorithms, it
is typical for someone who is trying to explain something new to try
to do so in terms that allow the gentle reader to do as direct a
comparison as possible between the things with which they are familiar
(e.g. - in this case, relational database theory) and the things with
which they are perhaps NOT familiar (e.g. - in this case, MV
databases).

Nobody seems to have been prepared to explain the MV model in adequate
theoretical terms as to allow the gentle readers to compare the theory
behind it with the other theories out there.

I'm afraid that does not reflect very well on either those lauding MV
or those trashing it.

I think one MAJOR problem is that most (if not all) MV practitioners are
not formally qualified in computing ... for example by education I'm a
chemist. And I'm doing postgrad in medical science ...

The trouble is - we MV'ers tend to take an engineering approach - we use
it because we know it works. To quote you from another post ...

When people _don't_ do that "thinking differently," we are certain to
see hideous performance, and that is neither a SQL issue nor a
"relational" issue. The point is that if they are accessing a big
pile of data, they have to think carefully [jumping to that "different
way of thinking"] irrespective of what specific language(s),
libraries, or other tools they are using.

Well, as far as we MV'ers are concerned, performance IS a problem with
the relational approach. The attitude (as far as I can tell) with
relational is to hide the actual DB implementation from the programmers.
So it is a design "flaw" that it is extremely easy for a programmer to
do something stupid. And you need a DBA to try and protect the database
from the programmers!

As soon as a requirement for a database specifies extraction of the
maximum power from the box, it OUGHT to rule out all the current
relational databases. MV flattens it for it for performance. As an MV
programmer, I *KNOW* that I can find any thing I'm looking for (or find
out it doesn't exist) with just ONE disk seek. A relational programmer
has to ask the db "does this exist" and hope the db is optimised to be
able to return the result quickly. To quote the Pick FAQ "SQL optimises
the easy task of finding stuff in memory. Pick optimises the hard task
of getting it into memory in the first place".

"Relational" is all about theory and proving things mathematically
correct. "MV" is all about engineering and getting the result. And if
that means pinching all the best ideas we can find from relational, then
we're engineers - of course we'll do it :-)

"Think different". Think Engineering, not Maths. And for $DEITY's sake
stop going on about science. Unless you can use set theory to predict
the future, relational has nothing to do with science ...

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#36Josh Berkus
josh@agliodbs.com
In reply to: Anthony W. Youngman (#35)
Re: Dreaming About Redesigning SQL

Anthony,

And don't other databases have both theory and model?

Actually, no, the "new" databases do not.

The relational model is backed by relational algebra and relational calculus,
plus a series of postulates and laws which have been refined and tested over
20 years.

Not Object-Oriented databases nor XML "databases", nor Multi-Value databases
have any body of theory behind them, mathematical or otherwise. I defy you
to post a single paper that has a mathematical theory for MV or OODB, or even
a firm set of laws that govern such a database. Nor is the industry moving
toward developing such a theory; instead the marketeers of commercial OODB
and XMLDB use a lot of ink to denigrate the idea of mathematical theory as
antiquated and stuffy, and in one case even using their advertising clout to
drive critical theorists off the pages of IT magazines (see Fabian Pascal's
web page).

Actually, amusingly enough, there is a body of theory backing XML databases,
but it's not one any current devloper would want to admit to: the XML
database is functionally identical to the Network Databases of the 1960's.
Of course, Network Databases gave way, historically, to Relational for good
reason.

And MV databases, despite decades of existence, never developed any theory
behind them at all, possibly because one is not possible; MV databases are
entirely an ad-hoc creation designed to work around decade-old limits in
computer processing. Pick is merely TextMagic revived and put on the web.

Now, OODB could certainly *develop* a model and theory, and I think it's high
time it did. The Zope project has amply demonstrated the usefulness of OODB
for certain applications. But as long as there is no OODB calculus, and no
industry-agreed model, and no ANSI standard language or interface, each and
every OODB will be 100% incompatible with every other one ... severely
limiting their utility.

The importance of theory, model, and standards is *not* to be overstated in an
industry where every year the industry-favorite commerical databases get more
ad-hoc, further from the theory, and more callous in their disregard of
international standards.

FWIW, I share your dissatisfaction with SQL, but because it's not relational
enough rather than the other way around.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#37Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Anthony W. Youngman (#35)
Re: Dreaming About Redesigning SQL

Anthony W. Youngman wrote:

Well, as far as we MV'ers are concerned, performance IS a problem with
the relational approach. The attitude (as far as I can tell) with
relational is to hide the actual DB implementation from the programmers.
So it is a design "flaw" that it is extremely easy for a programmer to
do something stupid. And you need a DBA to try and protect the database
from the programmers!

As soon as a requirement for a database specifies extraction of the
maximum power from the box, it OUGHT to rule out all the current
relational databases. MV flattens it for it for performance. As an MV
programmer, I *KNOW* that I can find any thing I'm looking for (or find
out it doesn't exist) with just ONE disk seek. A relational programmer
has to ask the db "does this exist" and hope the db is optimised to be
able to return the result quickly. To quote the Pick FAQ "SQL optimises
the easy task of finding stuff in memory. Pick optimises the hard task
of getting it into memory in the first place".

So in your opinion, is the problem

1) SQL is so hard that the average programmer will not know how to use it
efficiently
or
2) Relational (or SQL-) DBMS'es are just too slow

If 2) then why don't we get a bit more concrete. Could you give
an example of a query that in your experience would be too slow using
a standard SQL database (e.g. Oracle, or MySQL). We could then
actually try it out on some machine and compare. I suggest using
the customer-order-order_detail-product database

If 1) I would like to hear some concrete examples.

best regards,
Lauri Pietarinen

#38Sailesh Krishnamurthy
sailesh@cs.berkeley.edu
In reply to: Josh Berkus (#36)
Re: Dreaming About Redesigning SQL

"Josh" == Josh Berkus <josh@agliodbs.com> writes:

This is an unfair characterization of XML databases, and I can say
this without accusations of bias for I vastly prefer working with the
relational model.

Josh> Actually, amusingly enough, there is a body of theory
Josh> backing XML databases, but it's not one any current devloper
Josh> would want to admit to: the XML database is functionally
Josh> identical to the Network Databases of the 1960's. Of
Josh> course, Network Databases gave way, historically, to
Josh> Relational for good reason.

If you look at the academic research work, there have been gazillions
of recent papers on XML database technology. All the major database
vendors (Oracle, IBM and Microsoft) are investing fairly heavily in
core-engine XMLDB technology.

Finally, while it is true that some of XML db technology is evocative
of network databases, XML databases are certainly more than network
databases. For one, they are semi-structure .. in addition they
present query language access to their data (although I'm not a big
fan of XQuery).

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#39Christopher Browne
cbbrowne@acm.org
In reply to: Josh Berkus (#36)
Re: Dreaming About Redesigning SQL

In an attempt to throw the authorities off his trail,
sailesh@cs.berkeley.edu (Sailesh Krishnamurthy) transmitted:

"Josh" == Josh Berkus <josh@agliodbs.com> writes:

This is an unfair characterization of XML databases, and I can say
this without accusations of bias for I vastly prefer working with
the relational model.

Josh> Actually, amusingly enough, there is a body of theory
Josh> backing XML databases, but it's not one any current devloper
Josh> would want to admit to: the XML database is functionally
Josh> identical to the Network Databases of the 1960's. Of
Josh> course, Network Databases gave way, historically, to
Josh> Relational for good reason.

If you look at the academic research work, there have been
gazillions of recent papers on XML database technology. All the
major database vendors (Oracle, IBM and Microsoft) are investing
fairly heavily in core-engine XMLDB technology.

Ah, but do "papers" honestly indicate the emergence of some underlying
theoretical model for which fidelity could be evaluated?

Or merely that academics are looking to write papers on whatever
topics can attract research funding?

Half the articles in SIGOS have been about pretend applications of
Java to operating systems; why does it seem likely that the "database
academics" are any less affected by this?

I haven't yet seen a coherent "XML theory" emerge from the chaos.

It's not too dissimilar from the "object chaos;" the only works I am
aware of that try hard to provide theory behind "object orientedness"
are _A Theory of Objects_ by Abadi and Cardelli, and Benjamin Pierce's
book, _Types and Programming Languages_. After twenty-odd years of
object oriented programming, I find it quite appalling that there is
as little theoretical OO literature as there is.

Finally, while it is true that some of XML db technology is
evocative of network databases, XML databases are certainly more
than network databases. For one, they are semi-structure .. in
addition they present query language access to their data (although
I'm not a big fan of XQuery).

CODASYL had a query system, albeit something that looked more like
assembly language than anything else...
--
output = reverse("gro.gultn" "@" "enworbbc")
http://www.ntlug.org/~cbbrowne/nonrdbms.html
Rules of the Evil Overlord #63. "Bulk trash will be disposed of in
incinerators, not compactors. And they will be kept hot, with none of
that nonsense about flames going through accessible tunnels at
predictable intervals." <http://www.eviloverlord.com/&gt;

#40Josh Berkus
josh@agliodbs.com
In reply to: Sailesh Krishnamurthy (#38)
Re: Dreaming About Redesigning SQL

Sailesh,

Warning: I get carried away in this response. I'm afraid that I'm a fond
reader of Fabian Pascal and CJ Date, so I have far too much to say on the
topic. So if you really care about XML databases, you should probably hold
off on reading the rest until you're well-caffinated and in a cheerful frame
of mind.

Also, let me clarify that there is a distinction between using XML *as a*
database, and putting XML documents into databases of other types. I find
the latter obvious and sensible, but the former a silly and wrong-headed
idea, and it's the pure-XML-database which I attack below.

If you want to really have this out, I live in San Francisco and I love to
argue. Coffee at Intermezzo? I'll buy.

-------------------------------

If you look at the academic research work, there have been gazillions
of recent papers on XML database technology.

Point me to one which presents an algebra, calculus, or other mathematical
underpinning of XML databases, and I will be happy to eat my words on this
list. I can easily find lots of papers using google, but all of them are
about *technical implementation* and do not provide a theoretical
underpinning for XML databases.

A few (such as Dan Suciu's paper) present some theory to back XQuery but it is
presented entirely as an XML-based data access extension to SQL ... a role
which seems fine to me.

Others, even those cited by xmldb.org like the below, have rather lukewarm
things to say on the topic, such as David Mertz, PhD:
(http://www-106.ibm.com/developerworks/library/x-matters8/index.html)

"XML is an extremely versatile data transport format, but despite high hopes
for it, XML is mediocre to poor as a data storage and access format. ..."
<snip>
" ...XML has no inherent mechanism for enforcing constraints of this sort
(DTDs and schemas are constraints of a different, more limited sort). Without
constraints, you just have data, not a data model (to slightly oversimplify
matters). ..." <snip>
" ... In other words, go ahead and be excited by XML's promise of a universal
data transport mechanism, but keep your backend data on something designed
for it, like DB2 or Oracle (or on Postgres or MySQL for smaller-scale
systems)."

And this guy is cited by XMLDB.org? Perhaps not surprising, as among the 5
goals of XMLDB.org, development of a standard theory of XML databases is not
present.

All the major database
vendors (Oracle, IBM and Microsoft) are investing fairly heavily in
core-engine XMLDB technology.

So? Oracle, IBM and Microsoft also have SQL databases that do a terrible job
of upholding the SQL standard, and their (at least Oracle's and Microsoft's)
adherence is getting worse with successive versions rather than better. I
wouldn't look to them for guidance.

If they're spending millions on XML Databases, it's becuase it, however
wrong-headed, is a fad and fads mean sales, and they don't want to take a
chance on missing out. And these companies have backed plenty of useless
technologies before; remember Microsoft's "Periodicals on CD"?

Not that I'm against XML; as far as I'm concerned, for interchangable,
searchable, and archival documents, XML is the greatest thing since sliced
Beatles. I love XML-RPC for pushing data through HTTP, and I will happily
be in the cheering squad for anyone who writes a set of OSS tools to extract
data from XML docs stored in a PostgreSQL database, or to automate
some-standard-XML-to-relational-data-and-back conversion. That is a good
application of XML+Database ideas.

XML databases, on the other hand, are an example of taking a good idea too
far. XML is a great data transmission tool; it's a great document
transformation tool; it's a good way to store documents. It is not,
however, a good database.
------------------------------------------------------
--
Josh Berkus
Aglio Database Solutions
San Francisco

#41Sailesh Krishnamurthy
sailesh@cs.berkeley.edu
In reply to: Christopher Browne (#39)
Re: Dreaming About Redesigning SQL

"Christopher" == Christopher Browne <cbbrowne@acm.org> writes:

Christopher> Ah, but do "papers" honestly indicate the emergence
Christopher> of some underlying theoretical model for which
Christopher> fidelity could be evaluated?

Certainly. The model is that of semi-structured data, where often
times schema isn't very clearly defined. It's a different model from
the relational model - which I'm partial to. There are situations
where the XML model does make sense.

Christopher> Or merely that academics are looking to write papers
Christopher> on whatever topics can attract research funding?

Bash academics if you want. The truth is that industry is also working
on it.

As I said before, I have no axe to grind in this. I might be in
academia now, but frankly I couldn't give a toss about XML.

Christopher> Half the articles in SIGOS have been about pretend
Christopher> applications of Java to operating systems; why does
Christopher> it seem likely that the "database academics" are any
Christopher> less affected by this?

I think you are looking at the wrong publishing location :-) The
premier venue for the OS community are the SOSP and OSDI
conferences. Please look at the SOSP04 papers - you'll find fairly
good systems work.

BTW, I'm not sure which database papers you read - the premeer venues
for database systems work are the SIGMOD and VLDB conferences.

Christopher> CODASYL had a query system, albeit something that
Christopher> looked more like assembly language than anything

Please take a fair look at the XQuery data model and the XQuery
language before comparing it with CODASYL. I will not admit (at least
in public :-) to being a big fan of XQuery but that is because of
certain details, not anything fundamental.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#42Paul Vernon
paul.vernon@ukk.ibmm.comm
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:bmutga$jdk$1@nyytiset.pp.htv.fi...

Anthony W. Youngman wrote:

Well, as far as we MV'ers are concerned, performance IS a problem with
the relational approach. The attitude (as far as I can tell) with
relational is to hide the actual DB implementation from the programmers.
So it is a design "flaw" that it is extremely easy for a programmer to
do something stupid. And you need a DBA to try and protect the database
from the programmers!

As soon as a requirement for a database specifies extraction of the
maximum power from the box, it OUGHT to rule out all the current
relational databases. MV flattens it for it for performance. As an MV
programmer, I *KNOW* that I can find any thing I'm looking for (or find
out it doesn't exist) with just ONE disk seek. A relational programmer
has to ask the db "does this exist" and hope the db is optimised to be
able to return the result quickly. To quote the Pick FAQ "SQL optimises
the easy task of finding stuff in memory. Pick optimises the hard task
of getting it into memory in the first place".

So in your opinion, is the problem

1) SQL is so hard that the average programmer will not know how to use it
efficiently
or
2) Relational (or SQL-) DBMS'es are just too slow

No, I think Anthony is just saying that he doesn't "believe" in science/the
scientific method. Or maybe he believes that engineering is not based on
scientific knowledge!

"Think different". Think Engineering, not Maths. And for $DEITY's sake
stop going on about science. Unless you can use set theory to predict
the future, relational has nothing to do with science ...

Regards
Paul Vernon
Business Intelligence, IBM Global Services

#43Christopher Browne
cbbrowne@libertyrms.info
In reply to: Josh Berkus (#36)
Re: Dreaming About Redesigning SQL

sailesh@cs.berkeley.edu (Sailesh Krishnamurthy) writes:

"Christopher" == Christopher Browne <cbbrowne@acm.org> writes:

Christopher> Ah, but do "papers" honestly indicate the emergence
Christopher> of some underlying theoretical model for which
Christopher> fidelity could be evaluated?

Certainly. The model is that of semi-structured data, where often
times schema isn't very clearly defined. It's a different model from
the relational model - which I'm partial to. There are situations
where the XML model does make sense.

Christopher> Or merely that academics are looking to write
Christopher> papers on whatever topics can attract research
Christopher> funding?

Bash academics if you want. The truth is that industry is also
working on it.

As I said before, I have no axe to grind in this. I might be in
academia now, but frankly I couldn't give a toss about XML.

Fifteen years ago, there were _vital_ things going on in academia;
Stonebraker's group was assortedly working on Postgres, Ingres, and
Cohera. _Real_ OS research was going on, with active work on such
systems as Sprite, BSD, Mach, with _many_ schools building their own
OS kernels to explore different edges.

The problem is that with the dearth of funding for "basic research,"
academia seems to be getting used as a partly-government-funded
"development" area for the developments that industry hopes might work
but where they would rather evade responsibility for the failures.

Christopher> Half the articles in SIGOS have been about pretend
Christopher> applications of Java to operating systems; why does
Christopher> it seem likely that the "database academics" are any
Christopher> less affected by this?

I think you are looking at the wrong publishing location :-) The
premier venue for the OS community are the SOSP and OSDI
conferences. Please look at the SOSP04 papers - you'll find fairly
good systems work.

When the first paragraph of the annual report contains the phrase

... despite the "operating systems" in our name ...

I can't see this boding well for the area of "operating systems
research" being particularly vital. They have had to expand it to
some notion of "systems" that is sufficiently wide open that I'm not
sure what it _wouldn't_ admit as "relevant."

BTW, I'm not sure which database papers you read - the premeer
venues for database systems work are the SIGMOD and VLDB
conferences.

I haven't been subscribing to anything for a couple years now, TODS
having pretty much fallen into irrelevance. I suppose I need to look
to see if there is something worth following again.

Christopher> CODASYL had a query system, albeit something that
Christopher> looked more like assembly language than anything

Please take a fair look at the XQuery data model and the XQuery
language before comparing it with CODASYL. I will not admit (at
least in public :-) to being a big fan of XQuery but that is because
of certain details, not anything fundamental.

I have used it a bit for extracting data out of SOAP requests; I
wasn't terribly impressed with it, finding that with the way it was
remapping data, I would commonly have to code my own "tree walker" to
do things _right_.

It's not hard to do; when I had to modify my Galeon "bookmark walker"
a few months ago when they changed from a custom XML form to RDF, I
was pleasantly surprised to find that most of the changes involved
changing names of tags in DEFCONSTANT declarations. (Guess the
language :-).) Of course, "web bookmark" systems are likely to be
structured similarly, so that this change would be easy shouldn't be
much of a surprise.

But when you need to write code, that indicates that there is some
weakness to the model...
--
let name="cbbrowne" and tld="libertyrms.info" in String.concat "@" [name;tld];;
<http://dev6.int.libertyrms.com/&gt;
Christopher Browne
(416) 646 3304 x124 (land)

#44Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <bmutga$jdk$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

Well, as far as we MV'ers are concerned, performance IS a problem with
the relational approach. The attitude (as far as I can tell) with
relational is to hide the actual DB implementation from the programmers.
So it is a design "flaw" that it is extremely easy for a programmer to
do something stupid. And you need a DBA to try and protect the database
from the programmers!

As soon as a requirement for a database specifies extraction of the
maximum power from the box, it OUGHT to rule out all the current
relational databases. MV flattens it for it for performance. As an MV
programmer, I *KNOW* that I can find any thing I'm looking for (or find
out it doesn't exist) with just ONE disk seek. A relational programmer
has to ask the db "does this exist" and hope the db is optimised to be
able to return the result quickly. To quote the Pick FAQ "SQL optimises
the easy task of finding stuff in memory. Pick optimises the hard task
of getting it into memory in the first place".

So in your opinion, is the problem

1) SQL is so hard that the average programmer will not know how to use it
efficiently

Nope

or
2) Relational (or SQL-) DBMS'es are just too slow

Yes.

If 2) then why don't we get a bit more concrete. Could you give
an example of a query that in your experience would be too slow using
a standard SQL database (e.g. Oracle, or MySQL). We could then
actually try it out on some machine and compare. I suggest using
the customer-order-order_detail-product database

Okay. Give me a FORMULA that returns a time in seconds for your query.

Let's assume I want to print a statement of how many invoices were sent
to a customer, along with various details of those invoices. My invoice
file is indexed by company/month, and we can reasonably assume that the
time taken to produce the statement is infinitesimal compared to the
time taken to retrieve the invoice data from disk. For MV

T = (2 + N) * ST * 1.05

Where T is the time taken to produce the report, N is the number of
invoices, and ST is the hard disk seek time.

I've assumed I have to access the company details as well, hence the 2
(1 for company, 1 for the index). I've also assumed that the data isn't
cached in RAM, which I think is reasonable if we assume the hardware is
being stressed.

If 1) I would like to hear some concrete examples.

It's 2, so ...

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

And you know it's been proven that Huffman coding is the most efficient
compression algorithm? (Actually, it isn't - it's been proven it can't
be improved upon, which isn't the same thing...). Can you improve on the
formula I've just given you? Given that if we could change the 1.05 to 1
then we can prove it can't be improved upon ... again - I've taken the
liberty of assuming that a MV FILE is equivalent to an entity if we
assume the relational designer has been thinking in an entity-attribute-
relation sort of way. My maths isn't good enough to prove it, but I
think it would be pretty easy to prove that accessing data as "one and
only one complete entity" at a time is the most efficient way.

best regards,
Lauri Pietarinen

Looking forward to you coming up with maths that can prove relational
can even EQUAL MV :-)

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#45Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <bn0j82$1gnm$1@gazette.almaden.ibm.com>, Paul Vernon
<paul.vernon@ukk.ibmm.comm> writes

No, I think Anthony is just saying that he doesn't "believe" in science/the
scientific method. Or maybe he believes that engineering is not based on
scientific knowledge!

Actually, I *DO* believe in the Scientific Method.

I just fail to see the connection between Scientific Method and
Relational. The former is Science, the latter is Maths. Please tell me
how I can use relational theory to predict the future. Without that,
relational is unprovable, and hence unscientific.

Note I didn't say relational is *incorrect* - the ideas of
"mathematically correct" and "scientifically provable" are orthogonal,
and have nothing to say about each other.

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#46Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Anthony W. Youngman (#44)
Re: Dreaming About Redesigning SQL

Anthony W. Youngman wrote:

In article <bmutga$jdk$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

So in your opinion, is the problem

1) SQL is so hard that the average programmer will not know how to use it
efficiently

Nope

or
2) Relational (or SQL-) DBMS'es are just too slow

Yes.

If 2) then why don't we get a bit more concrete. Could you give
an example of a query that in your experience would be too slow using
a standard SQL database (e.g. Oracle, or MySQL). We could then
actually try it out on some machine and compare. I suggest using
the customer-order-order_detail-product database

Okay. Give me a FORMULA that returns a time in seconds for your query.

Let's assume I want to print a statement of how many invoices were sent
to a customer, along with various details of those invoices. My invoice
file is indexed by company/month, and we can reasonably assume that the
time taken to produce the statement is infinitesimal compared to the
time taken to retrieve the invoice data from disk. For MV

T = (2 + N) * ST * 1.05

Where T is the time taken to produce the report, N is the number of
invoices, and ST is the hard disk seek time.

First of all it is important to note that an important component of all
modern SQL-DBMS's is
the buffer pool (or cache) meaning that in a reasonably well tuned
database you get very few
disk I/O's, even when *writing* data into tables.

SQL-DBMS's also are very clever at using indexes, i.e. if they can find
all necessary data
from an index it will not even look at the table, so to speak.

And, even when presuming conservatively that there is no data in cache,
depending on how
the data is clustered, you will get more than one row/disk read (= 8K in
most(?) systems).

So, assuming the (simplified) example

Customer(cust_id, .....)
Order(order_id, cust_id,...)
OrderDetail(order_id, prod_id, ...
Product(prod_id,....)

If you created a clustering index on
Customer(cust_id)
Order(cust_id)
OrderDetail(order_id)

And presumed that the average length of
customer = 1K
order=500
orderDetail=300

You would get, with 3 I/O's
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would only apply to one order)

so, granted, that would result in one I/O per order which is more than
in your example.

I could now denormalise OrderDetail so that it contains cust_id also
and cluster by cust_id
(might cause you trouble down the road, if you can change the customer
of an order), in which case, with 3 I/O's I would get
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would all apply to one customer)

Now the amout of I/O's would depend on how many detail rows
we have per customer.

And, of course, because we are using sequential prefetch, we would be
getting more than one I/O block (8?, 16?) per disk seek, so it's a hard
comparison to
make but I suspect that it would about equal your example.

Now, that was a *conservative* estimate, and we assumed that we did not have
any rows lying around in the (global!) cache. As the size of the cache
grows in
proportion to the size of the total database we can assume less and less
disk I/O.

Note also that the cache can be configured many ways, you can put
different
tables (or indexes) in different caches, and even change the size of the
cache
on the fly (you might want a bigger cache during evening and night when your
batch programs are running) so you can rig your system to favour certain
types of queries.

I havn't even gone into the topic of using thick indexes so table access can
be totally avoided (=we are reading into memory only interesting columns).

Now, in your example, what if the product department comes along and
wants to make a report with sales / product? What would be your formula
in that case?

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by
decoupling logic
from implementation we leave the implementor (DBMS) to do as it feels
fit to do.
As DBMS technology advances, we get faster systems without having to
change our
programs.

When we design databases we can decouple logical planning from performance
considerations, which, you must agree, are two separate issues.

And you know it's been proven that Huffman coding is the most efficient
compression algorithm? (Actually, it isn't - it's been proven it can't
be improved upon, which isn't the same thing...). Can you improve on the
formula I've just given you? Given that if we could change the 1.05 to 1
then we can prove it can't be improved upon ... again - I've taken the
liberty of assuming that a MV FILE is equivalent to an entity if we
assume the relational designer has been thinking in an entity-attribute-
relation sort of way. My maths isn't good enough to prove it, but I
think it would be pretty easy to prove that accessing data as "one and
only one complete entity" at a time is the most efficient way.

I think that in a typical system your cache hit ratio would approach 90%
so that could mean 0.1 disk seeks.

best regards,
Lauri Pietarinen

#47Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:3F94BCBB.7030001@atbusiness.com...

Anthony W. Youngman wrote:

In article <bmutga$jdk$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

So in your opinion, is the problem

1) SQL is so hard that the average programmer will not know how to use

it

efficiently

Nope

or
2) Relational (or SQL-) DBMS'es are just too slow

Yes.

If 2) then why don't we get a bit more concrete. Could you give
an example of a query that in your experience would be too slow using
a standard SQL database (e.g. Oracle, or MySQL). We could then
actually try it out on some machine and compare. I suggest using
the customer-order-order_detail-product database

Okay. Give me a FORMULA that returns a time in seconds for your query.

Let's assume I want to print a statement of how many invoices were sent
to a customer, along with various details of those invoices. My invoice
file is indexed by company/month, and we can reasonably assume that the
time taken to produce the statement is infinitesimal compared to the
time taken to retrieve the invoice data from disk. For MV

T = (2 + N) * ST * 1.05

Where T is the time taken to produce the report, N is the number of
invoices, and ST is the hard disk seek time.

First of all it is important to note that an important component of all
modern SQL-DBMS's is
the buffer pool (or cache) meaning that in a reasonably well tuned
database you get very few
disk I/O's, even when *writing* data into tables.

SQL-DBMS's also are very clever at using indexes, i.e. if they can find
all necessary data
from an index it will not even look at the table, so to speak.

And, even when presuming conservatively that there is no data in cache,
depending on how
the data is clustered, you will get more than one row/disk read (= 8K in
most(?) systems).

So, assuming the (simplified) example

Customer(cust_id, .....)
Order(order_id, cust_id,...)
OrderDetail(order_id, prod_id, ...
Product(prod_id,....)

If you created a clustering index on
Customer(cust_id)
Order(cust_id)
OrderDetail(order_id)

And presumed that the average length of
customer = 1K
order=500
orderDetail=300

You would get, with 3 I/O's
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would only apply to one order)

so, granted, that would result in one I/O per order which is more than
in your example.

I could now denormalise OrderDetail so that it contains cust_id also
and cluster by cust_id
(might cause you trouble down the road, if you can change the customer
of an order), in which case, with 3 I/O's I would get
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would all apply to one customer)

Depending on block size, by clustering the three tables together, one might
get all of those rows for a single read potentially improving on Wol's
numbers by a factor of eight or more for this one query. Of course, doing so
would increase the cost of a table scan on the customer table.

#48Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Bob Badour (#47)
Re: Dreaming About Redesigning SQL

Bob Badour wrote:

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:3F94BCBB.7030001@atbusiness.com...

I could now denormalise OrderDetail so that it contains cust_id also
and cluster by cust_id
(might cause you trouble down the road, if you can change the customer
of an order), in which case, with 3 I/O's I would get
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would all apply to one customer)

Depending on block size, by clustering the three tables together, one might
get all of those rows for a single read potentially improving on Wol's
numbers by a factor of eight or more for this one query. Of course, doing so
would increase the cost of a table scan on the customer table.

Which DBMS'es support clustering of mutiple tables except for Oracle?
Is this feature really used any more?
I thought it was more trouble than worth.

regards,
Lauri

#49Dann Corbit
DCorbit@connx.com
In reply to: Lauri Pietarinen (#48)
Re: Dreaming About Redesigning SQL

-----Original Message-----
From: Lauri Pietarinen [mailto:lauri.pietarinen@atbusiness.com]
Sent: Sunday, October 19, 2003 1:50 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] Dreaming About Redesigning SQL

Anthony W. Youngman wrote:

Well, as far as we MV'ers are concerned, performance IS a

problem with

the relational approach. The attitude (as far as I can tell) with
relational is to hide the actual DB implementation from the
programmers. So it is a design "flaw" that it is extremely

easy for a

programmer to do something stupid. And you need a DBA to try and
protect the database from the programmers!

As soon as a requirement for a database specifies extraction of the
maximum power from the box, it OUGHT to rule out all the current
relational databases. MV flattens it for it for performance.

As an MV

programmer, I *KNOW* that I can find any thing I'm looking

for (or find

out it doesn't exist) with just ONE disk seek. A relational

programmer

has to ask the db "does this exist" and hope the db is

optimised to be

able to return the result quickly. To quote the Pick FAQ

"SQL optimises

the easy task of finding stuff in memory. Pick optimises the

hard task

of getting it into memory in the first place".

So in your opinion, is the problem

1) SQL is so hard that the average programmer will not know
how to use it efficiently or

This same average programmer will have the same difficulty with a pure,
orthogonal and mathematically rigorous language.

2) Relational (or SQL-) DBMS'es are just too slow

If 2) then why don't we get a bit more concrete. Could you
give an example of a query that in your experience would be
too slow using a standard SQL database (e.g. Oracle, or
MySQL). We could then actually try it out on some machine and
compare. I suggest using the
customer-order-order_detail-product database

If 1) I would like to hear some concrete examples.

I once worked on a telephone customer support system that had 5 million
lines of C++ and 5 million lines of SQL. It was too complex for the
average person to comprehend.

There are always going to be some queries that are too slow. If you put
a key on everything, data modifications will go into the toilet. If you
lack a key on a field required for a filtered query, you will have a
table scan. Hence, there is no free lunch, and there will have to be
compromises. However, due to the exponential expansion of compute
power, the efficiency of SQL is the least of our worries. See this
link, it makes for excellent reading:
http://www.kurzweilai.net/articles/art0134.html?printable=1

Show quoted text

best regards,
Lauri Pietarinen

---------------------------(end of
broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index
scan if your
joining column's datatypes do not match

#50Hannu Krosing
hannu@tm.ee
In reply to: Anthony W. Youngman (#35)
Re: Dreaming About Redesigning SQL

Anthony W. Youngman kirjutas P, 19.10.2003 kell 21:24:

As soon as a requirement for a database specifies extraction of the
maximum power from the box, it OUGHT to rule out all the current
relational databases. MV flattens it for it for performance. As an MV
programmer, I *KNOW* that I can find any thing I'm looking for (or find
out it doesn't exist) with just ONE disk seek.

Relational or not, this requires either in-memory index or perfect hash.

BTW, how do you find the oldest red elephant "with just one disk seek"?

as in SQL:

select from elephants where colour=red order by age desc limit 1;

A relational programmer
has to ask the db "does this exist" and hope the db is optimised to be
able to return the result quickly. To quote the Pick FAQ "SQL optimises
the easy task of finding stuff in memory. Pick optimises the hard task
of getting it into memory in the first place".

SQL by itself optimises nothing: by definition it evaluates full cross
products and then compares all rows with predicates.

Some SQL implementations do optimse a little ;)

"Relational" is all about theory and proving things mathematically
correct. "MV" is all about engineering and getting the result.

Or perhaps just getting _the_ result ;)

getting some other result will probably need another MV database ;)

Unless you can use set theory to predict the future,

Isn't this what PostgreSQL's optimiser does ?

--------------
Hannu

#51Yurgis Baykshtis
ybaykshtis@micropat.com
In reply to: Dann Corbit (#49)
1 attachment(s)
Help!!! FreeSpaceMap hashtalble out of memory.

Cygwin, 7.3.4
This thing is really KILLING us and our customers.

In pgerr.log this always go together:

WARNING: ShmemAlloc: out of memory
ERROR: FreeSpaceMap hashtable out of memory

Theses errors usually take place on INSERT statements like this one:

INSERT INTO params (param_id,map_id,param_key,param_value) VALUES
(0,858,'MAP_NAME','New Map')

params is just a simple table (no triggers or rules involved):
CREATE TABLE map_11.params
(
param_id int4 NOT NULL,
map_id int4 NOT NULL,
param_key varchar(1024),
param_value text,
CONSTRAINT params_pkey PRIMARY KEY (param_id)
) WITH OIDS;

The db server machine has 4GB of physical memory and more than half is
always available.
Number of connection is 64
Number of buffers was increased to 81920 (but this does not help)

Is there any parameters that can be configured to increase available shared
memory size?

Any help is greatly appreciated.

Attachments:

winmail.datapplication/ms-tnef; name=winmail.datDownload
#52Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:bn3tve$qln$1@nyytiset.pp.htv.fi...

Bob Badour wrote:

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:3F94BCBB.7030001@atbusiness.com...

I could now denormalise OrderDetail so that it contains cust_id also
and cluster by cust_id
(might cause you trouble down the road, if you can change the customer
of an order), in which case, with 3 I/O's I would get
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would all apply to one customer)

Depending on block size, by clustering the three tables together, one

might

get all of those rows for a single read potentially improving on Wol's
numbers by a factor of eight or more for this one query. Of course, doing

so

would increase the cost of a table scan on the customer table.

Which DBMS'es support clustering of mutiple tables except for Oracle?

I don't know. Why would it matter?

Is this feature really used any more?

If one has a hard performance requirement that only clustering can meet, one
will use it.

I thought it was more trouble than worth.

All physical structures will bias performance for some operations and
against others. In general, increasing the cost of customer scans will be
sufficiently unpleasant to make clustering customers with orders
undesirable. However, if one chooses to consider only one physical
arrangement and one operations, as Wol is wont to do, I observe we can
outperform his product by a factor of eight.

#53Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Bob Badour (#52)
Re: Dreaming About Redesigning SQL

Bob Badour wrote:

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:bn3tve$qln$1@nyytiset.pp.htv.fi...

Bob Badour wrote:

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:3F94BCBB.7030001@atbusiness.com...

I could now denormalise OrderDetail so that it contains cust_id also
and cluster by cust_id
(might cause you trouble down the road, if you can change the customer
of an order), in which case, with 3 I/O's I would get
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would all apply to one customer)

Depending on block size, by clustering the three tables together, one

might

get all of those rows for a single read potentially improving on Wol's
numbers by a factor of eight or more for this one query. Of course, doing

so

would increase the cost of a table scan on the customer table.

Which DBMS'es support clustering of mutiple tables except for Oracle?

I don't know. Why would it matter?

Just curious...

Is this feature really used any more?

If one has a hard performance requirement that only clustering can meet, one
will use it.

OK

I thought it was more trouble than worth.

All physical structures will bias performance for some operations and
against others. In general, increasing the cost of customer scans will be
sufficiently unpleasant to make clustering customers with orders
undesirable. However, if one chooses to consider only one physical
arrangement and one operations, as Wol is wont to do, I observe we can
outperform his product by a factor of eight.

OK, right...

Lauri

#54Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <3F94BCBB.7030001@atbusiness.com>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Okay. Give me a FORMULA that returns a time in seconds for your query.

Let's assume I want to print a statement of how many invoices were sent
to a customer, along with various details of those invoices. My invoice
file is indexed by company/month, and we can reasonably assume that the
time taken to produce the statement is infinitesimal compared to the
time taken to retrieve the invoice data from disk. For MV

T = (2 + N) * ST * 1.05

Where T is the time taken to produce the report, N is the number of
invoices, and ST is the hard disk seek time.

First of all it is important to note that an important component of all modern
SQL-DBMS's is
the buffer pool (or cache) meaning that in a reasonably well tuned database you
get very few
disk I/O's, even when *writing* data into tables.

Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
say the spec said "extract maximum performance from the hardware
available".

You're assuming that you can throw hardware at the problem - fine, but
that's not always possible. You might have already maxed out the ram,
you might have a "huge" database, you might be sharing your db server
with other programs (BIND really likes to chew up every available drop
of ram, doesn't it :-).

I'm not saying that you shouldn't throw hardware at it, but what if you
can't?

SQL-DBMS's also are very clever at using indexes, i.e. if they can find all
necessary data
from an index it will not even look at the table, so to speak.

Same with MV

And, even when presuming conservatively that there is no data in cache,
depending on how
the data is clustered, you will get more than one row/disk read (= 8K in most(?)
systems).

Same with MV

So, assuming the (simplified) example

Customer(cust_id, .....)
Order(order_id, cust_id,...)
OrderDetail(order_id, prod_id, ...
Product(prod_id,....)

If you created a clustering index on
Customer(cust_id)
Order(cust_id)
OrderDetail(order_id)

And presumed that the average length of
customer = 1K
order=500
orderDetail=300

You would get, with 3 I/O's
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would only apply to one order)

so, granted, that would result in one I/O per order which is more than
in your example.

I could now denormalise OrderDetail so that it contains cust_id also
and cluster by cust_id
(might cause you trouble down the road, if you can change the customer
of an order), in which case, with 3 I/O's I would get
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would all apply to one customer)

Now the amout of I/O's would depend on how many detail rows
we have per customer.

And, of course, because we are using sequential prefetch, we would be
getting more than one I/O block (8?, 16?) per disk seek, so it's a hard
comparison to
make but I suspect that it would about equal your example.

Except my example was an *average* case, and yours is a *best* case. Oh,
and my data is still normalised - I haven't had to denormalise it! AND I
haven't run an optimiser over it :-)

Now, that was a *conservative* estimate, and we assumed that we did not have
any rows lying around in the (global!) cache. As the size of the cache grows in
proportion to the size of the total database we can assume less and less disk
I/O.

You're relying on the hardware to bale you out :-) We can do the same!

Note also that the cache can be configured many ways, you can put different
tables (or indexes) in different caches, and even change the size of the cache
on the fly (you might want a bigger cache during evening and night when your
batch programs are running) so you can rig your system to favour certain
types of queries.

I havn't even gone into the topic of using thick indexes so table access can
be totally avoided (=we are reading into memory only interesting columns).

Now, in your example, what if the product department comes along and
wants to make a report with sales / product? What would be your formula
in that case?

I'm not quite sure what you're trying to do. I'll assume you want a
report of all invoices which refer to a given product. Assuming I've got
the relevant indices defined, I can simply read a list of invoices from
the product code index, a second list of invoices from the month index,
and do an intersect of the two lists.

So again, T = (2+N) * ST * 1.05 where N is the number of invoices that
reference that product. And now ALL the invoice data has been retrieved
from disk to ram ...

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by decoupling
logic
from implementation we leave the implementor (DBMS) to do as it feels fit to do.
As DBMS technology advances, we get faster systems without having to change our
programs.

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

When we design databases we can decouple logical planning from performance
considerations, which, you must agree, are two separate issues.

And you know it's been proven that Huffman coding is the most efficient
compression algorithm? (Actually, it isn't - it's been proven it can't
be improved upon, which isn't the same thing...). Can you improve on the
formula I've just given you? Given that if we could change the 1.05 to 1
then we can prove it can't be improved upon ... again - I've taken the
liberty of assuming that a MV FILE is equivalent to an entity if we
assume the relational designer has been thinking in an entity-attribute-
relation sort of way. My maths isn't good enough to prove it, but I
think it would be pretty easy to prove that accessing data as "one and
only one complete entity" at a time is the most efficient way.

I think that in a typical system your cache hit ratio would approach 90%
so that could mean 0.1 disk seeks.

That improves our performance just as much as improves yours. What
happens to your response time if you just DON'T HAVE the cache
available, for whatever reason?

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the circumstances)
assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in
elegance, they believe in beauty. And when I look at relational, all I
see is the theorists pleading "power", "hardware", "brute force", to get
them out of trouble. And then all these people, who believe in maths
over reality, are surprised when I turn round and say I despise their
beliefs.

Note, I did NOT say I despise relational theory. I despise the belief
that it is the answer to life, the database universe, and everything
data related. (By the way, 6 times 9 DOES equal 42 :-)

best regards,
Lauri Pietarinen

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#55Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Anthony W. Youngman (#54)
Re: Dreaming About Redesigning SQL

Anthony W. Youngman wrote:

In article <3F94BCBB.7030001@atbusiness.com>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Okay. Give me a FORMULA that returns a time in seconds for your query.

Let's assume I want to print a statement of how many invoices were sent
to a customer, along with various details of those invoices. My invoice
file is indexed by company/month, and we can reasonably assume that the
time taken to produce the statement is infinitesimal compared to the
time taken to retrieve the invoice data from disk. For MV

T = (2 + N) * ST * 1.05

Where T is the time taken to produce the report, N is the number of
invoices, and ST is the hard disk seek time.

First of all it is important to note that an important component of all modern
SQL-DBMS's is
the buffer pool (or cache) meaning that in a reasonably well tuned database you
get very few
disk I/O's, even when *writing* data into tables.

Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
say the spec said "extract maximum performance from the hardware
available".

So what's wrong with gettng a machine with lots of memory? How much
does 2G of
memory for an Intel-box cost now a days? Is this some kind of new
ultimate sport, trying
to get along with as little memory as possible?

You're assuming that you can throw hardware at the problem - fine, but
that's not always possible. You might have already maxed out the ram,
you might have a "huge" database, you might be sharing your db server
with other programs (BIND really likes to chew up every available drop
of ram, doesn't it :-).

I'm not saying that you shouldn't throw hardware at it, but what if you
can't?

SQL-DBMS's also are very clever at using indexes, i.e. if they can find all
necessary data

from an index it will not even look at the table, so to speak.

Same with MV

And, even when presuming conservatively that there is no data in cache,
depending on how
the data is clustered, you will get more than one row/disk read (= 8K in most(?)
systems).

Same with MV

I could now denormalise OrderDetail so that it contains cust_id also
and cluster by cust_id
(might cause you trouble down the road, if you can change the customer
of an order), in which case, with 3 I/O's I would get
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would all apply to one customer)

Now the amout of I/O's would depend on how many detail rows
we have per customer.

And, of course, because we are using sequential prefetch, we would be
getting more than one I/O block (8?, 16?) per disk seek, so it's a hard
comparison to
make but I suspect that it would about equal your example.

Except my example was an *average* case, and yours is a *best* case. Oh,
and my data is still normalised - I haven't had to denormalise it! AND I
haven't run an optimiser over it :-)

Are you hiding your optimiser behind the curtain? ;-)

Well, if it is normalised, how easy is it for you to change the
customer_id of an order? Anyway,
if we stick to your example and even if we don't normalise using e.g.
clustering features of Oracle,
as Bob pointed out, we are getting at most the same number of I/O's.
So, answer to your
question: our formula is at least as good as yours.

Now, that was a *conservative* estimate, and we assumed that we did not have
any rows lying around in the (global!) cache. As the size of the cache grows in
proportion to the size of the total database we can assume less and less disk
I/O.

You're relying on the hardware to bale you out :-) We can do the same!

Well why don't you?

Note also that the cache can be configured many ways, you can put different
tables (or indexes) in different caches, and even change the size of the cache
on the fly (you might want a bigger cache during evening and night when your
batch programs are running) so you can rig your system to favour certain
types of queries.

I havn't even gone into the topic of using thick indexes so table access can
be totally avoided (=we are reading into memory only interesting columns).

Now, in your example, what if the product department comes along and
wants to make a report with sales / product? What would be your formula
in that case?

I'm not quite sure what you're trying to do. I'll assume you want a
report of all invoices which refer to a given product. Assuming I've got
the relevant indices defined, I can simply read a list of invoices from
the product code index, a second list of invoices from the month index,
and do an intersect of the two lists.

I want a list with all products with corresponding total sales, read
from order detail e.g.

Hammer 10000$
Nail 5000$
Screw 1200$

How many disk reads (or head movements)?

So again, T = (2+N) * ST * 1.05 where N is the number of invoices that
reference that product. And now ALL the invoice data has been retrieved
from disk to ram ...

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

No, I want you to give me a list of all your customers. How many disk
reads?

Show quoted text

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by decoupling
logic

from implementation we leave the implementor (DBMS) to do as it feels fit to do.

As DBMS technology advances, we get faster systems without having to change our
programs.

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

When we design databases we can decouple logical planning from performance
considerations, which, you must agree, are two separate issues.

And you know it's been proven that Huffman coding is the most efficient
compression algorithm? (Actually, it isn't - it's been proven it can't
be improved upon, which isn't the same thing...). Can you improve on the
formula I've just given you? Given that if we could change the 1.05 to 1
then we can prove it can't be improved upon ... again - I've taken the
liberty of assuming that a MV FILE is equivalent to an entity if we
assume the relational designer has been thinking in an entity-attribute-
relation sort of way. My maths isn't good enough to prove it, but I
think it would be pretty easy to prove that accessing data as "one and
only one complete entity" at a time is the most efficient way.

I think that in a typical system your cache hit ratio would approach 90%
so that could mean 0.1 disk seeks.

That improves our performance just as much as improves yours. What
happens to your response time if you just DON'T HAVE the cache
available, for whatever reason?

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the circumstances)
assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in
elegance, they believe in beauty. And when I look at relational, all I
see is the theorists pleading "power", "hardware", "brute force", to get
them out of trouble. And then all these people, who believe in maths
over reality, are surprised when I turn round and say I despise their
beliefs.

Note, I did NOT say I despise relational theory. I despise the belief
that it is the answer to life, the database universe, and everything
data related. (By the way, 6 times 9 DOES equal 42 :-)

best regards,
Lauri Pietarinen

Cheers,
Wol

#56Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:bn4cca$dj0$1@nyytiset.pp.htv.fi...

Anthony W. Youngman wrote:

In article <3F94BCBB.7030001@atbusiness.com>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Okay. Give me a FORMULA that returns a time in seconds for your query.

Let's assume I want to print a statement of how many invoices were sent
to a customer, along with various details of those invoices. My invoice
file is indexed by company/month, and we can reasonably assume that the
time taken to produce the statement is infinitesimal compared to the
time taken to retrieve the invoice data from disk. For MV

T = (2 + N) * ST * 1.05

Where T is the time taken to produce the report, N is the number of
invoices, and ST is the hard disk seek time.

First of all it is important to note that an important component of all

modern

SQL-DBMS's is
the buffer pool (or cache) meaning that in a reasonably well tuned

database you

get very few
disk I/O's, even when *writing* data into tables.

Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
say the spec said "extract maximum performance from the hardware
available".

So what's wrong with gettng a machine with lots of memory? How much
does 2G of
memory for an Intel-box cost now a days? Is this some kind of new
ultimate sport, trying
to get along with as little memory as possible?

You're assuming that you can throw hardware at the problem - fine, but
that's not always possible. You might have already maxed out the ram,
you might have a "huge" database, you might be sharing your db server
with other programs (BIND really likes to chew up every available drop
of ram, doesn't it :-).

I'm not saying that you shouldn't throw hardware at it, but what if you
can't?

SQL-DBMS's also are very clever at using indexes, i.e. if they can find

all

necessary data

from an index it will not even look at the table, so to speak.

Same with MV

And, even when presuming conservatively that there is no data in cache,
depending on how
the data is clustered, you will get more than one row/disk read (= 8K in

most(?)

systems).

Same with MV

I could now denormalise OrderDetail so that it contains cust_id also
and cluster by cust_id
(might cause you trouble down the road, if you can change the customer
of an order), in which case, with 3 I/O's I would get
- 8 customer rows
- 16 order rows
- 24 order detail rows (which would all apply to one customer)

Now the amout of I/O's would depend on how many detail rows
we have per customer.

And, of course, because we are using sequential prefetch, we would be
getting more than one I/O block (8?, 16?) per disk seek, so it's a hard
comparison to
make but I suspect that it would about equal your example.

Except my example was an *average* case, and yours is a *best* case. Oh,
and my data is still normalised - I haven't had to denormalise it! AND I
haven't run an optimiser over it :-)

Are you hiding your optimiser behind the curtain? ;-)

Well, if it is normalised, how easy is it for you to change the
customer_id of an order? Anyway,
if we stick to your example and even if we don't normalise using e.g.
clustering features of Oracle,
as Bob pointed out, we are getting at most the same number of I/O's.
So, answer to your
question: our formula is at least as good as yours.

Actually, Bob pointed out we are getting at most 12.5% as many disk head
movements or I/O's. I'll take an 87.5% improvement any day.

Now, that was a *conservative* estimate, and we assumed that we did not

have

any rows lying around in the (global!) cache. As the size of the cache

grows in

proportion to the size of the total database we can assume less and less

disk

I/O.

You're relying on the hardware to bale you out :-) We can do the same!

Well why don't you?

We achieved 8 times the performance with exactly the same hardware. What the
hell is this idiot talking about us relying on hardware? He is a moron. You
will do everyone a favour if you just bounce him off the bottom of your
killfile.

Note also that the cache can be configured many ways, you can put

different

tables (or indexes) in different caches, and even change the size of the

cache

on the fly (you might want a bigger cache during evening and night when

your

batch programs are running) so you can rig your system to favour certain
types of queries.

I havn't even gone into the topic of using thick indexes so table access

can

be totally avoided (=we are reading into memory only interesting

columns).

Now, in your example, what if the product department comes along and
wants to make a report with sales / product? What would be your formula
in that case?

I'm not quite sure what you're trying to do. I'll assume you want a
report of all invoices which refer to a given product. Assuming I've got
the relevant indices defined, I can simply read a list of invoices from
the product code index, a second list of invoices from the month index,
and do an intersect of the two lists.

I want a list with all products with corresponding total sales, read
from order detail e.g.

Hammer 10000$
Nail 5000$
Screw 1200$

How many disk reads (or head movements)?

So again, T = (2+N) * ST * 1.05 where N is the number of invoices that
reference that product. And now ALL the invoice data has been retrieved
from disk to ram ...

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

No, I want you to give me a list of all your customers. How many disk
reads?

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by

decoupling

logic
from implementation we leave the implementor (DBMS) to do as it feels

fit to do.

As DBMS technology advances, we get faster systems without having to

change our

programs.

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

This idiot is a fucking fool. He makes untrue assertions and actually
believes them.

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best.

No, actually, we did not. We already said that, assuming identical hardware
and caching, our average case exceeds his best case by a factor of eight.

If you are going to engage these ridiculously ignorant and stupid pick
zealots, you must do a better job of identifying the horseshit they spout
with just about every word.

And my case is *already* assuming that the

system is seriously stressed and struggling ...

When we design databases we can decouple logical planning from

performance

considerations, which, you must agree, are two separate issues.

And you know it's been proven that Huffman coding is the most efficient
compression algorithm? (Actually, it isn't - it's been proven it can't
be improved upon, which isn't the same thing...). Can you improve on

the

formula I've just given you?

Yes, by a factor of eight as already demonstrated.

Given that if we could change the 1.05 to 1
then we can prove it can't be improved upon ...

Okay, let him prove it in spite of the factor of 8 improvement we already
achieved.

[remainder of Wol's unthinking, blind horseshit snipped]

#57Noname
michael@preece.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Bob Badour" <bbadour@golden.net> wrote in message news:<0J-dncfSRf9EJQiiU-KYuA@golden.net>...
[snip]

Actually, Bob pointed out ...

[snip]

Why don't you go and bang your heads together Bob.

#58Josh Berkus
josh@agliodbs.com
In reply to: Noname (#57)
Re: Dreaming About Redesigning SQL

Wol,

I think one MAJOR problem is that most (if not all) MV practitioners are
not formally qualified in computing ...

<snip>

"Relational" is all about theory and proving things mathematically
correct. "MV" is all about engineering and getting the result. And if
that means pinching all the best ideas we can find from relational, then
we're engineers - of course we'll do it :-)

So what you're saying is that you use MV databases becuase you find them
easier to understand and set up than relational databases. Sure. For that
matter, spreadsheets are easier to understand and set up than relational
databases -- or MV databases, for that matter.

I've no problem with the idea that MV databases are easier for the neophyte to
understand. But you seem to be making this bizarre logical leap that "easier
to understand without training" == "technically superior".

And I noticed that you have completely backed off from your assertion that
there was some kind of MV database theory without ever addressing it.

As soon as a requirement for a database specifies extraction of the
maximum power from the box, it OUGHT to rule out all the current
relational databases. MV flattens it for it for performance. As an MV
programmer, I *KNOW* that I can find any thing I'm looking for (or find
out it doesn't exist) with just ONE disk seek. A relational programmer
has to ask the db "does this exist" and hope the db is optimised to be
able to return the result quickly.

Well, frankly, no. You're confusing your floundering because you find it
hard to grasp relational database design and/or have a badly designed
database with a performance problem. There are several people in our
community running large genetics databases on PostgreSQL, using good
relational design, and doing well with them.

To quote the Pick FAQ "SQL optimises
the easy task of finding stuff in memory. Pick optimises the hard task
of getting it into memory in the first place".

And hope to the gods that nobody pulls the power cord.

"Relational" is all about theory and proving things mathematically
correct. "MV" is all about engineering and getting the result. And if
that means pinching all the best ideas we can find from relational, then
we're engineers - of course we'll do it :-)

"Relational" is all about preserving the *long-term* integrity, utility, and
accessability of your data. "MV" is all about getting an expedient result
immediately and to heck with the future.

Read some of E.F. Codd's original papers ... he was not concerned with
expanding some obscure batch of mathematics, he was concerned with the
problems of data that wasn't clean, or wasn't the same when you queried it
twice, or data with rules that changed eraticaly over time, and wasn't
portable at all. These are real, "engineering" problems that needed solving
and relational theory solved it.

. Unless you can use set theory to predict
the future, relational has nothing to do with science ...

I wasn't aware that clairvoyance was a tenet of the scientific method.
Science is about reproducing reliable results, which is also what relational
theory is about.

To conclude: Your entire advocacy of Pick can be summed up as "I like Pick
and find it easy to program, therefore it is superior." Well, Wol, that
makes it superior for *you* but not for the rest of us. If you want to use
Pick and not PostgreSQL, go for it; but if you barge in here and try to
convince us that Pick is superior based strictly on your say-so, you've
become one of the crazy preachers on Sproul Plaza.

--
Josh Berkus
Aglio Database Solutions
San Francisco

#59Sailesh Krishnamurthy
sailesh@cs.berkeley.edu
In reply to: Josh Berkus (#58)
Re: Dreaming About Redesigning SQL

"Josh" == Josh Berkus <josh@agliodbs.com> writes:

"Relational" is all about theory and proving things
mathematically correct. "MV" is all about engineering and
getting the result. And if that means pinching all the best
ideas we can find from relational, then we're engineers - of
course we'll do it :-)

Josh> "Relational" is all about preserving the *long-term*
Josh> integrity, utility, and accessability of your data. "MV" is
Josh> all about getting an expedient result immediately and to
Josh> heck with the future.

To emphasize the *long-term* - relational is all about separating
physical database representation from a logical view of the data.

Josh, I'll be happy to meet at Intermezzo - after Nov 24 perhaps. I
have a conference deadline .. the papers that we write "just to secure
funding" - your tax dollars at work. Long live the NSF !

BTW, I'm not necessarily that far from your views. There is, however,
more to an XML database than just storing data - relational databases
do just fine there. The tools that are used to build database systems
have other uses - for instance XML message brokers.

--
Pip-pip
Sailesh
http://www.cs.berkeley.edu/~sailesh

#60Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <bn4cca$dj0$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
say the spec said "extract maximum performance from the hardware
available".

So what's wrong with gettng a machine with lots of memory? How much
does 2G of
memory for an Intel-box cost now a days? Is this some kind of new
ultimate sport, trying
to get along with as little memory as possible?

I presume you didn't read the bit below ... what if you have SEVERAL
tables, and EACH of them is a gigabyte or two in size?

If an engineer has a problem, throwing brute force at it is rarely the
solution. Let's be topical (near enough) and look at the Titanic (seeing
as there was this film recently). If they'd forseen the problem, they
could have thrown brute force at it and doubled the thickness of the
steel plate. Except she would have then sunk when they launched her,
before she even had a chance to hit the iceberg. Or look at aviation -
especially in the early years. They had gliders that could fly, and they
had engines that could easily provide the power to get a glider
airborne. The problem was, every time they increased the power of the
engine they got *further* *away* from the possibility of powered flight,
because the increased power came at the price of increased weight.

You're welcome to live in your mathematical world where power can be
gained for no cost, but that doesn't work in the real world. And the
cost isn't necessarily dollars. Like in the aircraft example, the cost
could be a case of "sorry, technology ain't that advanced yet mate!"

You're assuming that you can throw hardware at the problem - fine, but
that's not always possible. You might have already maxed out the ram,
you might have a "huge" database, you might be sharing your db server
with other programs (BIND really likes to chew up every available drop
of ram, doesn't it :-).

I'm not saying that you shouldn't throw hardware at it, but what if you
can't?

Except my example was an *average* case, and yours is a *best* case. Oh,
and my data is still normalised - I haven't had to denormalise it! AND I
haven't run an optimiser over it :-)

Are you hiding your optimiser behind the curtain? ;-)

Well, if you include getting optimisation for free because "that's the
way things work", maybe I am ;-)

Well, if it is normalised, how easy is it for you to change the
customer_id of an order? Anyway,

Incredibly easy. Just update the "customer_id" field of the invoice
record. A single change to a single "row"

if we stick to your example and even if we don't normalise using e.g.
clustering features of Oracle,
as Bob pointed out, we are getting at most the same number of I/O's.
So, answer to your
question: our formula is at least as good as yours.

Except I think Bob said we could "optimise to favour *certain*
transactions". I think actually ANY transaction benefits. You're relying
on stuff that's outwith your theory, we're relying on stuff that's
inherent to our model.

Now, that was a *conservative* estimate, and we assumed that we did not have
any rows lying around in the (global!) cache. As the size of the cache grows

in

proportion to the size of the total database we can assume less and less disk
I/O.

You're relying on the hardware to bale you out :-) We can do the same!

Well why don't you?

We let the hardware help us out if it can. There's a big difference. If
you can't get the hardware, you're stuffed. We don't need it, so while
we may have a hard time of it it's nowhere near as bad for us.

And again, relational separates the physical from the logical. You're
being hypocritical if you call upon the physical representation to help
out with the (speed of the) logical presentation.

Note also that the cache can be configured many ways, you can put different
tables (or indexes) in different caches, and even change the size of the cache
on the fly (you might want a bigger cache during evening and night when your
batch programs are running) so you can rig your system to favour certain
types of queries.

I havn't even gone into the topic of using thick indexes so table access can
be totally avoided (=we are reading into memory only interesting columns).

Now, in your example, what if the product department comes along and
wants to make a report with sales / product? What would be your formula
in that case?

I'm not quite sure what you're trying to do. I'll assume you want a
report of all invoices which refer to a given product. Assuming I've got
the relevant indices defined, I can simply read a list of invoices from
the product code index, a second list of invoices from the month index,
and do an intersect of the two lists.

I want a list with all products with corresponding total sales, read
from order detail e.g.

Hammer 10000$
Nail 5000$
Screw 1200$

How many disk reads (or head movements)?

Actually, probably the same as you here. If we're indexed on order
detail. If Hammer appears in N invoices, then T = (1+N) * ST * 1.05 for
hammers, and the same for all the other products.

Theory favours us, in that if a product appears X times in one invoice,
that's one read for us and X for you, but hardware will probably help
you more than us (that is, assuming thrashing cuts in) in that you stand
a marginally higher chance of getting multiple instances of a product in
any given read.

So again, T = (2+N) * ST * 1.05 where N is the number of invoices that
reference that product. And now ALL the invoice data has been retrieved
from disk to ram ...

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

No, I want you to give me a list of all your customers. How many disk
reads?

T = N * 1.05 where N is the number of customers. What do you want to
know about those customers? Address? Phone number*s*? Anything else?
That's *all* at no extra cost.

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by

decoupling

logic

from implementation we leave the implementor (DBMS) to do as it feels fit to

do.

As DBMS technology advances, we get faster systems without having to change

our

programs.

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

When we design databases we can decouple logical planning from performance
considerations, which, you must agree, are two separate issues.

Yes. BUT what's the point of having a database that is logically
perfect, and who's performance is slow to the point of being unusable?

Don't forget - in practice MultiValue ends up with a database that is
*inherently* optimised such that it almost invariably outperforms an
equivalent SQL database, AND we don't normally have DBAs to help us
achieve that nirvana ...

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the circumstances)
assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in
elegance, they believe in beauty. And when I look at relational, all I
see is the theorists pleading "power", "hardware", "brute force", to get
them out of trouble.

You said that logical planning and performance are separate issues. And
I wouldn't expect you to address the above example in a discussion of
relational, because performance is irrelevant to relational.

But surely, the fact that I am SUPREMELY CONFIDENT that I can get
superior performance from inferior hardware should give you pause for
thought that maybe, just maybe, the relational model is flawed from an
engineer's or scientist's viewpoint?

From the mathematician's (or logician's) viewpoint I agree it's
flawless. But that's true of plenty of broken scientific theories...

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#61Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Anthony W. Youngman (#60)
Re: Dreaming About Redesigning SQL

Anthony W. Youngman wrote:

In article <bn4cca$dj0$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
say the spec said "extract maximum performance from the hardware
available".

So what's wrong with gettng a machine with lots of memory? How much
does 2G of
memory for an Intel-box cost now a days? Is this some kind of new
ultimate sport, trying
to get along with as little memory as possible?

I presume you didn't read the bit below ... what if you have SEVERAL
tables, and EACH of them is a gigabyte or two in size?

OK, I get your point.

Well, if it is normalised, how easy is it for you to change the
customer_id of an order? Anyway,

Incredibly easy. Just update the "customer_id" field of the invoice
record. A single change to a single "row"

And I presume the system will automatically move all related stuff
(order details etc.) into
the same block as the new customer? How long will that take? What if
there is no room for it there?

if we stick to your example and even if we don't normalise using e.g.
clustering features of Oracle,
as Bob pointed out, we are getting at most the same number of I/O's.
So, answer to your
question: our formula is at least as good as yours.

Except I think Bob said we could "optimise to favour *certain*
transactions". I think actually ANY transaction benefits. You're relying
on stuff that's outwith your theory, we're relying on stuff that's
inherent to our model.

That certainly is not true. The theory says NOTHING about how data
should be arranged on disk.
You are talking about how modern SQL-databases behave. The DBMS is at
liberty to do whatever
it pleases with the data, even save it in a PICK database. Hey, wadda
you think? Would that be
a good idea? We get to keep our SQL but with the speed of PICK ;-)

Now, that was a *conservative* estimate, and we assumed that we did not have
any rows lying around in the (global!) cache. As the size of the cache grows

in

proportion to the size of the total database we can assume less and less disk
I/O.

You're relying on the hardware to bale you out :-) We can do the same!

Well why don't you?

We let the hardware help us out if it can. There's a big difference. If
you can't get the hardware, you're stuffed. We don't need it, so while
we may have a hard time of it it's nowhere near as bad for us.

And again, relational separates the physical from the logical. You're
being hypocritical if you call upon the physical representation to help
out with the (speed of the) logical presentation.

My goodness, no I'm not! Its the same as claiming that if you have a
drawing for a house, you
have to make that house out of paper?!?

I want a list with all products with corresponding total sales, read

from order detail e.g.

Hammer 10000$
Nail 5000$
Screw 1200$

How many disk reads (or head movements)?

Actually, probably the same as you here.

If we're indexed on order
detail. If Hammer appears in N invoices, then T = (1+N) * ST * 1.05 for
hammers, and the same for all the other products.

Theory favours us, in that if a product appears X times in one invoice,
that's one read for us and X for you, but hardware will probably help
you more than us (that is, assuming thrashing cuts in) in that you stand
a marginally higher chance of getting multiple instances of a product in
any given read.

So for each product you get T = (1+N) * ST * 1.05.

Now, for our SQL-DBMS, presuming that we build indexes for detail and
product:

order_detail(product_id, qty, unit_price) = 20 bytes/row
product(product_id, product_name) = 50 bytes/row

With 2 disk reads I would get
8K/20 = 400 order detail rows and
8K/50 = 160 product rows

Since all rows are in product_id order, no need for random disk reads so
T = 1 + N/400 + P/160 (N=number of details, P=number of products)
for ALL products and details.

And, because of sequential prefetch, we probably would not have to wait
for I/O's at all.

Really, however you calculate it, it is an order of magnitude less
than your alternative.

And please don't tell me that using indexes is not fair or not in the
spirit of the
relational model ;-)

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

No, I want you to give me a list of all your customers. How many disk
reads?

T = N * 1.05 where N is the number of customers. What do you want to
know about those customers? Address? Phone number*s*? Anything else?
That's *all* at no extra cost.

Well, no thanks. I just wanted their names this time.
The relational alternative, with an index on customer_name, would be
again an order
of magnitune less disk reads.

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by

decoupling

logic

from implementation we leave the implementor (DBMS) to do as it feels fit to

do.

As DBMS technology advances, we get faster systems without having to change

our

programs.

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

What is "too large"?

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

It does?

When we design databases we can decouple logical planning from performance
considerations, which, you must agree, are two separate issues.

Yes. BUT what's the point of having a database that is logically
perfect, and who's performance is slow to the point of being unusable?

Don't forget - in practice MultiValue ends up with a database that is
*inherently* optimised such that it almost invariably outperforms an
equivalent SQL database, AND we don't normally have DBAs to help us
achieve that nirvana ...

Frankly, it may well be that PICK systems run faster and cheaper than
relational ones, but certainly
not for the reasons you state.

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the circumstances)
assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in
elegance, they believe in beauty. And when I look at relational, all I
see is the theorists pleading "power", "hardware", "brute force", to get
them out of trouble.

You said that logical planning and performance are separate issues. And
I wouldn't expect you to address the above example in a discussion of
relational, because performance is irrelevant to relational.

I would have to know a lot more details to address it properly.
Performance is irrelevant to the model.
It's like E=mc**2. Nice theory and it actually works. But to get
performance out of it
(=exploding bomb) you have to solve lots of practical details. However,
without the theory
you could experiment for a milloin years without being able to build an
atom bomb.

But surely, the fact that I am SUPREMELY CONFIDENT that I can get
superior performance from inferior hardware should give you pause for
thought that maybe, just maybe, the relational model is flawed from an
engineer's or scientist's viewpoint?

That's OK with me. But the most you can claim is that todays
IMPLEMENTATIONS are flawed,
and you would be 100% correct. How would you go and prove that the model
is flawed?
You should prove that a relational DBMS could not POSSIBLY be efficient.

From the mathematician's (or logician's) viewpoint I agree it's
flawless. But that's true of plenty of broken scientific theories...

Could you give me some other examples?

best regards,
Lauri Pietarinen

Show quoted text

Cheers,
Wol

#62Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:bn72o3$as$1@nyytiset.pp.htv.fi...

Anthony W. Youngman wrote:

In article <bn4cca$dj0$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL

and

MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
say the spec said "extract maximum performance from the hardware
available".

So what's wrong with gettng a machine with lots of memory? How much
does 2G of
memory for an Intel-box cost now a days? Is this some kind of new
ultimate sport, trying
to get along with as little memory as possible?

I presume you didn't read the bit below ... what if you have SEVERAL
tables, and EACH of them is a gigabyte or two in size?

OK, I get your point.

Well, if it is normalised, how easy is it for you to change the
customer_id of an order? Anyway,

Incredibly easy. Just update the "customer_id" field of the invoice
record. A single change to a single "row"

And I presume the system will automatically move all related stuff
(order details etc.) into
the same block as the new customer? How long will that take? What if
there is no room for it there?

if we stick to your example and even if we don't normalise using e.g.
clustering features of Oracle,
as Bob pointed out, we are getting at most the same number of I/O's.
So, answer to your
question: our formula is at least as good as yours.

Except I think Bob said we could "optimise to favour *certain*
transactions".

Exactly. This is as true for Pick as it is for any other file processor.

I think actually ANY transaction benefits.

Wol thinks a lot of things that are just plain wrong. That's inherent to his
ignorance and his stupidity.

You're relying
on stuff that's outwith your theory, we're relying on stuff that's
inherent to our model.

I am relying on reality, and Wol relies on fantasy. In his mind, he is right
and nothing will ever change his mind.

That certainly is not true. The theory says NOTHING about how data
should be arranged on disk.
You are talking about how modern SQL-databases behave.

No, he isn't. Wol doesn't even know how modern SQL-databases really behave.
He is talking about nothing but his own imagined prejudices.

The DBMS is at
liberty to do whatever
it pleases with the data, even save it in a PICK database. Hey, wadda
you think? Would that be
a good idea? We get to keep our SQL but with the speed of PICK ;-)

Now, that was a *conservative* estimate, and we assumed that we did

not have

any rows lying around in the (global!) cache. As the size of the cache

grows in

proportion to the size of the total database we can assume less and

less disk

I/O.

You're relying on the hardware to bale you out :-) We can do the same!

Well why don't you?

We let the hardware help us out if it can. There's a big difference. If
you can't get the hardware, you're stuffed. We don't need it, so while
we may have a hard time of it it's nowhere near as bad for us.

And again, relational separates the physical from the logical. You're
being hypocritical if you call upon the physical representation to help
out with the (speed of the) logical presentation.

My goodness, no I'm not! Its the same as claiming that if you have a
drawing for a house, you
have to make that house out of paper?!?

Don't you see? Wol is an ignorant moron. You will save a lot of bandwidth if
you simply ignore the idiot.

I want a list with all products with corresponding total sales, read

from order detail e.g.

Hammer 10000$
Nail 5000$
Screw 1200$

How many disk reads (or head movements)?

Actually, probably the same as you here.

If we're indexed on order
detail. If Hammer appears in N invoices, then T = (1+N) * ST * 1.05 for
hammers, and the same for all the other products.

Theory favours us, in that if a product appears X times in one invoice,
that's one read for us and X for you

No, theory does not favour Wol's product at all. Making ignorant and stupid
assertions about how many reads are required for a relational dbms does not
affect the actual number of reads required. Wol is an ignorant moron. No
matter how many times you explain these points to him, he will remain
convinced of Pick's mythical performance advantage.

, but hardware will probably help
you more than us (that is, assuming thrashing cuts in) in that you stand
a marginally higher chance of getting multiple instances of a product in
any given read.

So for each product you get T = (1+N) * ST * 1.05.

Now, for our SQL-DBMS, presuming that we build indexes for detail and
product:

order_detail(product_id, qty, unit_price) = 20 bytes/row
product(product_id, product_name) = 50 bytes/row

With 2 disk reads I would get
8K/20 = 400 order detail rows and
8K/50 = 160 product rows

Since all rows are in product_id order, no need for random disk reads so
T = 1 + N/400 + P/160 (N=number of details, P=number of products)
for ALL products and details.

And, because of sequential prefetch, we probably would not have to wait
for I/O's at all.

Really, however you calculate it, it is an order of magnitude less
than your alternative.

And please don't tell me that using indexes is not fair or not in the
spirit of the
relational model ;-)

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

No, I want you to give me a list of all your customers. How many disk
reads?

T = N * 1.05 where N is the number of customers. What do you want to
know about those customers? Address? Phone number*s*? Anything else?
That's *all* at no extra cost.

Well, no thanks. I just wanted their names this time.
The relational alternative, with an index on customer_name, would be
again an order
of magnitune less disk reads.

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by
decoupling logic from implementation we leave the implementor (DBMS)
to do as it feels fit to do.

As DBMS technology advances, we get faster systems without having to

change

our programs.

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

What is "too large"?

Don't forget. You've already said that, if nothing is cached, my

average

case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

It does?

When we design databases we can decouple logical planning from

performance

considerations, which, you must agree, are two separate issues.

Yes. BUT what's the point of having a database that is logically
perfect, and who's performance is slow to the point of being unusable?

Don't forget - in practice MultiValue ends up with a database that is
*inherently* optimised such that it almost invariably outperforms an
equivalent SQL database, AND we don't normally have DBAs to help us
achieve that nirvana ...

Frankly, it may well be that PICK systems run faster and cheaper than
relational ones, but certainly
not for the reasons you state.

How can anyone forget something that was never true?

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the

circumstances)

assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in
elegance, they believe in beauty. And when I look at relational, all I
see is the theorists pleading "power", "hardware", "brute force", to

get

them out of trouble.

You said that logical planning and performance are separate issues. And
I wouldn't expect you to address the above example in a discussion of
relational, because performance is irrelevant to relational.

I would have to know a lot more details to address it properly.
Performance is irrelevant to the model.
It's like E=mc**2. Nice theory and it actually works. But to get
performance out of it
(=exploding bomb) you have to solve lots of practical details. However,
without the theory
you could experiment for a milloin years without being able to build an
atom bomb.

But surely, the fact that I am SUPREMELY CONFIDENT that I can get
superior performance from inferior hardware should give you pause for
thought that maybe, just maybe, the relational model is flawed from an
engineer's or scientist's viewpoint?

Perhaps my supreme confidence (as an engineer trained in electrical
engineering) that Wol is both ignorant and stupid should give him pause for
thought. Maybe, just maybe, his confidence is based on incompetence.

That's OK with me. But the most you can claim is that todays
IMPLEMENTATIONS are flawed,
and you would be 100% correct. How would you go and prove that the model
is flawed?
You should prove that a relational DBMS could not POSSIBLY be efficient.

Actually, he should address the problem that AQL cannot even express simple
queries using simple data structures. Getting the wrong answer with blinding
speed is just that: blinding.

From the mathematician's (or logician's) viewpoint I agree it's
flawless. But that's true of plenty of broken scientific theories...

Could you give me some other examples?

Flawless?!? What an ignoramus!

#63Marshall Spight
mspight@dnai.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Anthony W. Youngman" <thewolery@nospam.demon.co.uk> wrote in message news:NZJktvDi7yj$EwcG@thewolery.demon.co.uk...

Just like the academics were
brainwashed into thinking that microkernels were the be-all and end-all
- until Linus showed them by practical example that they were all idiots

"The academics" (presumably you mean Tannenbaum et al) never
claimed that monolithic kernels could not obtain market acceptance;
they never said anything about market acceptance. Instead, they
had identified a number of weaknesses of monolithic kernels and
pointed out that a microkernel architecture didn't suffer from these
problems. Certainly the monolithic kernel is easier to implement.

Linus set out to build a unix kernel workalike, and he chose
the easiest path, copying architecture from the 1970s, along
with all the weaknesses that those idiot academics had identified
years earlier. Since then, his monolithic kernel has gotten a lot
of marketshare, due to a number of different reasons, none of
them being architectural superiority.

Marshall

#64Marshall Spight
mspight@dnai.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Anthony W. Youngman" <thewolery@nospam.demon.co.uk> wrote in message news:UGXSKIAGbtk$EwW3@thewolery.demon.co.uk...

As soon as a requirement for a database specifies extraction of the
maximum power from the box,

I don't for a second believe that this is your only requirement,
or that this is even an actual requirement. If it really is an
actual requirement, then I assume you're writing all of your
code in hand-tuned assembly language, and that the document
you consult most regularly when writing code is the CPU's
instruction timing table.

Another commodity box costs $1000, which is about the
same as the cost to a company of a day's programmer time.

What *really* matters is getting software delivered in a timely
fashion, that is as correct as possible, and that will operate
reliably over time and not cause data corruption.

Marshall

#65Marshall Spight
mspight@dnai.com
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Bob Badour" <bbadour@golden.net> wrote in message news:W46dnf4tbfF1DwiiU-KYgw@golden.net...

All physical structures will bias performance for some operations and
against others.

This strikes me as a succinct statement of the value of
data independence. One has the option (but not the
requirement) to adjust the physical structures the DBMS
uses while keeping the logical model (and therefor all
application code and queries, etc.) unchanged.

Unless one has data independence, one does not have
this option; one will be locked into a particular
performance model. This is why I found the MV
guy's obvious pleasure at being able to precisely
describe the performance model for his DB as odd:
I thought it a deficit to be able to say what it was;
he thought it an asset.

Marshall

PS. This is nothing you don't know, Bob; just a
comment for the group.

#66Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Marshall Spight" <mspight@dnai.com> wrote in message
news:mhMlb.2417$9E1.18525@attbi_s52...

"Bob Badour" <bbadour@golden.net> wrote in message

news:W46dnf4tbfF1DwiiU-KYgw@golden.net...

All physical structures will bias performance for some operations and
against others.

This strikes me as a succinct statement of the value of
data independence. One has the option (but not the
requirement) to adjust the physical structures the DBMS
uses while keeping the logical model (and therefor all
application code and queries, etc.) unchanged.

Unless one has data independence, one does not have
this option; one will be locked into a particular
performance model. This is why I found the MV
guy's obvious pleasure at being able to precisely
describe the performance model for his DB as odd:
I thought it a deficit to be able to say what it was;
he thought it an asset.

It becomes an obvious deficit as soon as he needs to improve upon the
performance for some operation and he has no way to do it. Thus, he lacks
the option to gain the factor of eight improvement for the first query
offered by clustering.

Marshall

PS. This is nothing you don't know, Bob; just a
comment for the group.

Of course. Likewise.

#67Hannu Krosing
hannu@tm.ee
In reply to: Marshall Spight (#63)
Re: Dreaming About Redesigning SQL

Marshall Spight kirjutas N, 23.10.2003 kell 11:01:

"Anthony W. Youngman" <thewolery@nospam.demon.co.uk> wrote in message news:NZJktvDi7yj$EwcG@thewolery.demon.co.uk...

Just like the academics were
brainwashed into thinking that microkernels were the be-all and end-all
- until Linus showed them by practical example that they were all idiots

...

Linus set out to build a unix kernel workalike, and he chose
the easiest path, copying architecture from the 1970s, along
with all the weaknesses that those idiot academics had identified
years earlier. Since then, his monolithic kernel has gotten a lot
of marketshare, due to a number of different reasons, none of
them being architectural superiority.

Unless you count as architectural superiority the fact that it can be
actually written and debugged in a reasonable time.

Being able to mathematically define something as not having certain
weaknesses does not quarantee that the thing can be actually implemented
and/or is usable.

--------------
Hannu

#68Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <mhMlb.2417$9E1.18525@attbi_s52>, Marshall Spight
<mspight@dnai.com> writes

"Bob Badour" <bbadour@golden.net> wrote in message news:W46dnf4tbfF1DwiiU-
KYgw@golden.net...

All physical structures will bias performance for some operations and
against others.

This strikes me as a succinct statement of the value of
data independence. One has the option (but not the
requirement) to adjust the physical structures the DBMS
uses while keeping the logical model (and therefor all
application code and queries, etc.) unchanged.

Unless one has data independence, one does not have
this option; one will be locked into a particular
performance model. This is why I found the MV
guy's obvious pleasure at being able to precisely
describe the performance model for his DB as odd:
I thought it a deficit to be able to say what it was;
he thought it an asset.

When you park your car, do you put the chassis on the drive, the engine
in the garage, and the wheels in the front garden?

You may find my approach of keeping data together strange, I just find
it extremely weird that you think it is an IMPROVEMENT to disassemble
what is in the real world a single thing. I'm sure you would not be
happy if I tried to disassemble YOU and store your head in one place,
your legs and arms in another, etc etc.

Can I refer you to something called "emergent complexity"? A scientific
theory of how the whole can be greater than the sum of its parts?

Harking to something else, I can't remember who said "the tuple is the
fundamental unit of data". Apart from the fact that such a statement is
not worth arguing with, I would compare that to the quark in physics. A
strange beast that is known to exist, but can never be found in reality.
And as a chemist, it is totally and utterly irrelevant to me. It pays to
know it's there just in case in some strange circumstance it should be
useful, but for the most part I can ignore it as just not part of my
reality.

Oh - and do you know why I was so pleased to describe the performance
model for my db? For the same reason as I mentioned Huffman compression.
It's impossible to prove that that Huffman is the most efficient
algorithm, and indeed I pointed out that it isn't. It is, however,
possible to prove that it is mathematically impossible for a more
efficient algorithm to exist.

I'm TOTALLY happy to be locked into a performance model, if I can PROVE
that there are no other models that are more efficient. My ability with
stats isn't good enough, but the figure bandied about is that there is
room for about 5% improvement before we hit that mathematical limit. SQL
has a HELL of a long way to go to catch up :-)

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#69Darren King
DarrenK@Routescape.com
In reply to: Anthony W. Youngman (#68)
Re: Dreaming About Redesigning SQL

"Bob" <bbadour@golden.net> spewed forth...

We achieved 8 times the performance with exactly the same
hardware. What the hell is this idiot talking about us
relying on hardware? He is a moron. You will do everyone
a favour if you just bounce him off the bottom of your
killfile.
...
[profanity-laced responses deleted]
...

It's about time to take this discussion private, guys. It has long since stopped being on-topic to hacking PostgreSql internals.

Darren

#70Mike Mascari
mascarm@mascari.com
In reply to: Darren King (#69)
Re: Dreaming About Redesigning SQL

Darren King wrote:

"Bob" <bbadour@golden.net> spewed forth...

We achieved 8 times the performance with exactly the same
hardware. What the hell is this idiot talking about us
relying on hardware? He is a moron. You will do everyone
a favour if you just bounce him off the bottom of your
killfile.
...
[profanity-laced responses deleted]
...

It's about time to take this discussion private, guys.
It has long since stopped being on-topic to hacking PostgreSql
internals.

Darren

I've learned that a feed into the postgresql-hackers mailing list from
comp.databases.postgresql.hackers can be easily spotted by its
astonishing lack of civility and intelligent discourse... :-(

Mike Mascari
mascarm@mascari.com

#71Marsh Ray
marsh-pg@mysteray.com
In reply to: Lauri Pietarinen (#46)
Re: Dreaming About Redesigning SQL

Lauri Pietarinen wrote:

The theory, indeed, does not say anything about buffer pools, but by
decoupling logic
from implementation we leave the implementor (DBMS) to do as it feels
fit to do.
As DBMS technology advances, we get faster systems without having to
change our
programs.

I think you've identified why relational systems have been the
overwhelming winner in the business environment. They allow vendors to
provide an optimized but fairly general solution to the interesting
problem of efficiently accessing and storing data on rotating magnetic
storage, while at the same time presenting a programming model that's at
just the right level for the business applications programmer.

Relational theory or no, linked tables are typically conceptualized as a
slight formalization of the spreadsheet, or (in earlier times) stacks of
punched cards. As business computers evolved from more specific machines
that could perform some relational operations on punched cards (sort,
select, etc.), I think it's still sort of stuck in the collective
unconscious of business to want to model their data this way.

I think relational theory is useful primarily to database implementers,
students, and those few application developers who are after a deeply
theoretical understanding of their tools. They're probably the ones
reading this list.

I suppose MV and other non-SQL data stores have their place in a certain
niches (embedded systems, etc.), but the business world has already
voted with it's feet.

- Marsh

#72Kevin Brown
kevin@sysexperts.com
In reply to: Noname (#23)
Re: Dreaming About Redesigning SQL

Dawn M. Wolthuis wrote:

So, nope, I'm not trolling. I've been doing some research the past
couple of years and I'm convinced that it is time to do something new
(and yet old) with data persistence.

Perhaps.

But before you go down that road, you have to answer the following
simple, yet possibly difficult-to-answer, question:

What problem are you trying to solve?

"Data persistence" is far too vague a term to be meaningful in its own
right -- the term needs some context to have any real meaning here.

We store data for a reason. Similarly, we retrieve it for a reason.
The data we're interested in looking for and the way we are interested
in looking for it will have a huge impact on any data retrieval solution
one could craft.

The relational model of data storage and retrieval is just a model,
highly suitable to some things and not suitable at all to others. The
amount of development work that has gone into it and the amount of use
it has gotten shows that the relational model is actually reasonably
good at meeting many of the data storage and retrieval needs that people
have. As with any method, it has tradeoffs and drawbacks,

There is no magic bullet and there never will be (or so experience
says). I have no reason to believe that the problem of data persistence
and retrieval is any exception to that.

If you have a particular set of data retrieval problems in mind that you
wish to solve, by all means feel free to develop the mathematical
foundation to solve them. Feel free to tell us that the relational
model is not suitable for that set of problems -- we might even agree
with you on that.

But don't make the claim that the relational model is lacking as a
result of not being a storage and retrieval method that is suitable to
all problems, and that there is a Better Way that will Solve Everything.
Many have made such claims about many different technologies. They were
wrong, too.

I may be misreading you and responding to arguments you aren't making or
implying, but if so I doubt I'm alone, based on some of the other
responses I've seen here.

By the way, language is only a means of expression, and the only sort
of question (relevant to this discussion, anyway) that a language is the
answer to is "what's the best way for me to express X?". It is neither
an answer to the question of how to retrieve data nor is it a solution to
the problem of storing data in a persistent manner. The answer to the
question of how best to query data is certainly going to be a language,
but the specific language one comes up with in answer to the question
will depend on what the person asking wants. "English" is likely to
be the best answer only under certain circumstances. SQL is likely to
be the best answer (or, at least, a very good answer) only under other
circumstances. It just depends. But as with any solution to any problem,
there is no one-size-fits-all solution. As a mathematician, you should
know this: the English language is horribly inadequate for expressing
mathematical concepts. That's why mathematicians don't use it as their
primary language for doing math. Why, then, should we expect English,
or Java, or any other language to be any better for performing certain
kinds of queries against data than some other, more directed language?
Say what you want about SQL, but at least it was designed with querying
table-based data in mind and is at least somewhat good at its job.

--
Kevin Brown kevin@sysexperts.com

#73James Rogers
jamesr@best.com
In reply to: Anthony W. Youngman (#45)
Re: Dreaming About Redesigning SQL

On Mon, 2003-10-20 at 13:50, Anthony W. Youngman wrote:

Note I didn't say relational is *incorrect* - the ideas of
"mathematically correct" and "scientifically provable" are orthogonal,
and have nothing to say about each other.

Eh?

"Mathematical" and "Scientific" reasoning (more correctly: axiomatic and
non-axiomatic reasoning, respectively) are the same thing. Any apparent
differences such that we can even make a distinction is the result of
differences in relative system sizes (in terms of Kolmogorov complexity)
in practice. If you think they are orthogonal, you don't understand the
nature of this particular beast.

Cheers,

-James Rogers
jamesr@best.com

#74Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Marsh Ray (#71)
Re: Dreaming About Redesigning SQL

Marsh Ray wrote:

Lauri Pietarinen wrote:

The theory, indeed, does not say anything about buffer pools, but by
decoupling logic
from implementation we leave the implementor (DBMS) to do as it feels
fit to do.
As DBMS technology advances, we get faster systems without having to
change our
programs.

I think you've identified why relational systems have been the
overwhelming winner in the business environment. They allow vendors to
provide an optimized but fairly general solution to the interesting
problem of efficiently accessing and storing data on rotating magnetic
storage, while at the same time presenting a programming model that's
at just the right level for the business applications programmer.

Relational theory or no, linked tables are typically conceptualized as
a slight formalization of the spreadsheet, or (in earlier times)
stacks of punched cards. As business computers evolved from more
specific machines that could perform some relational operations on
punched cards (sort, select, etc.), I think it's still sort of stuck
in the collective unconscious of business to want to model their data
this way.

I agree with you on that one. The punch cards history is well visible
in the fact that in IBM-mainframes, many files have
a width of 80 chars, which just happens to be the amount of characters
you could save on a punch card. And, yes,
tables are often thought of as a deck of index cards, something you
might have had in the past.

I think relational theory is useful primarily to database
implementers, students, and those few application developers who are
after a deeply theoretical understanding of their tools. They're
probably the ones reading this list.

I suppose MV and other non-SQL data stores have their place in a
certain niches (embedded systems, etc.), but the business world has
already voted with it's feet.

What I sense is a longing for a unified environment, something that SQL
+ [your app programming environment] does not provide.at the moment.
Hence the affection to Pick and other niche environments?

Lauri

#75Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <bn72o3$as$1@nyytiset.pp.htv.fi>, Lauri Pietarinen <lauri.pie
tarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

In article <bn4cca$dj0$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
say the spec said "extract maximum performance from the hardware
available".

So what's wrong with gettng a machine with lots of memory? How much
does 2G of
memory for an Intel-box cost now a days? Is this some kind of new
ultimate sport, trying
to get along with as little memory as possible?

I presume you didn't read the bit below ... what if you have SEVERAL
tables, and EACH of them is a gigabyte or two in size?

OK, I get your point.

Using technology to get you out of a hole is fine. Assuming it will be
there if you need it is not. And actually, this is one of the factors
hammering the MV model :-( Technology is now powerful enough to solve a
lot of problems simply by using brute force.

Well, if it is normalised, how easy is it for you to change the
customer_id of an order? Anyway,

Incredibly easy. Just update the "customer_id" field of the invoice
record. A single change to a single "row"

And I presume the system will automatically move all related stuff
(order details etc.) into
the same block as the new customer? How long will that take? What if
there is no room for it there?

Well, I'd view an order as an entity. As such, I would give it its own
FILE, and your question doesn't make sense. But if the system did move
the stuff, it would be four disk accesses - read/write to delete the old
entry, read/write to save the new. As for "enough room" - well - it'll
fall over if we have a "disk full" (or it might not).

if we stick to your example and even if we don't normalise using e.g.
clustering features of Oracle,
as Bob pointed out, we are getting at most the same number of I/O's.
So, answer to your
question: our formula is at least as good as yours.

Except I think Bob said we could "optimise to favour *certain*
transactions". I think actually ANY transaction benefits. You're relying
on stuff that's outwith your theory, we're relying on stuff that's
inherent to our model.

That certainly is not true. The theory says NOTHING about how data
should be arranged on disk.
You are talking about how modern SQL-databases behave. The DBMS is at
liberty to do whatever
it pleases with the data, even save it in a PICK database. Hey, wadda
you think? Would that be
a good idea? We get to keep our SQL but with the speed of PICK ;-)

That would be nice ;-) But I think our two paragraphs don't connect. I
was talking about MV ...

We let the hardware help us out if it can. There's a big difference. If
you can't get the hardware, you're stuffed. We don't need it, so while
we may have a hard time of it it's nowhere near as bad for us.

And again, relational separates the physical from the logical. You're
being hypocritical if you call upon the physical representation to help
out with the (speed of the) logical presentation.

My goodness, no I'm not! Its the same as claiming that if you have a
drawing for a house, you
have to make that house out of paper?!?

I want a list with all products with corresponding total sales, read

from order detail e.g.

Hammer 10000$
Nail 5000$
Screw 1200$

How many disk reads (or head movements)?

Actually, probably the same as you here.

If we're indexed on order
detail. If Hammer appears in N invoices, then T = (1+N) * ST * 1.05 for
hammers, and the same for all the other products.

Theory favours us, in that if a product appears X times in one invoice,
that's one read for us and X for you, but hardware will probably help
you more than us (that is, assuming thrashing cuts in) in that you stand
a marginally higher chance of getting multiple instances of a product in
any given read.

So for each product you get T = (1+N) * ST * 1.05.

Now, for our SQL-DBMS, presuming that we build indexes for detail and
product:

order_detail(product_id, qty, unit_price) = 20 bytes/row
product(product_id, product_name) = 50 bytes/row

With 2 disk reads I would get
8K/20 = 400 order detail rows and
8K/50 = 160 product rows

Since all rows are in product_id order, no need for random disk reads so
T = 1 + N/400 + P/160 (N=number of details, P=number of products)
for ALL products and details.

And, because of sequential prefetch, we probably would not have to wait
for I/O's at all.

Really, however you calculate it, it is an order of magnitude less
than your alternative.

And please don't tell me that using indexes is not fair or not in the
spirit of the
relational model ;-)

Well, it does result in data being stored multiple times ;-)

And while it maybe doesn't affect the result that much, you wanted the
value? Where has that come from? What if the price changed half way
through the period you're calculating? :-) You've failed to answer your
own question, so maybe I could match you ...

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

No, I want you to give me a list of all your customers. How many disk
reads?

T = N * 1.05 where N is the number of customers. What do you want to
know about those customers? Address? Phone number*s*? Anything else?
That's *all* at no extra cost.

Well, no thanks. I just wanted their names this time.
The relational alternative, with an index on customer_name, would be
again an order
of magnitune less disk reads.

Well, if you let me use an index here, I'm sorry, GAME OVER! The best
you can do would be a photo finish.

Assuming an overhead of, say, 4 bytes per index entry, the entire index
would be

Size = 4 * N + sigma(name_length) + sigma(key_length)

Okay, I've probably got some padding there as well, but so will you. And
note I didn't say N * field_length, I said sigma(name_length). ;-)

I notice that at no point have you asked where this strange 1.05 keeps
coming from. That's why I keep hammering performance ... okay, maybe
I've lost this "order detail" but it's why I keep hammering my
confidence in general. Given that your data is all ordered optimally for
answering this "detail" request, what's it going to cost you in time or
disk space to answer the request "please recreate invoice X for me"?

MV stores data efficently - look at how little space the index took :-)

It accesses data efficiently - that 1.05 is actually the "how many
places do I need to look" value for the database to respond to a
userland request, given a known primary key or index value. Okay - that
means we push back at the programmer some of the management of data
access, but why should that be solely the response of the dbms? If it
makes sense for the app to do it, then it should ... why should the dbms
have to guess at how to optimise a request if the app has all the
necessary information at its fingertips?

We're now getting into the realms of statistics - and my teacher's
attitude to stats was "you don't need it for the exam so I'm not
teaching it!" :-( So my arguments are more gut feel and experience than
proof, but experience tells me the proof wouldn't be difficult.

But surely, your requirement for grabbing data across multiple invoices
is statistically unusual. And I benefit just as much as you from any ram
being available to cache :-) although I wouldn't benefit so much from
prefetch. The probability is that consecutive requests for data are
either "can I know something else about the entity I've just looked at",
or "can I access another entity at random".

In the former case, if you've stored it in another table, it's another
request from the app to the dbms. With MV, it all came in the first
request. In the latter case, this is where my 1.05 factor cuts in - bear
in mind even for a simple btree file, this factor is only 1 for a
1-level root only file - it goes up to 1.5 when the root bucket splits
and keeps rising from there :-)

So as an engineer, here I am appealing to stats :-) But this is the real
world, and no stats? no real world! Because we have no optimiser, it
encourages the programmer to optimise - I've heard various people say
that if you want a SQL-using app to run fast you mustn't use views -
forcing the programmer to interest themselves in the db in a manner that
relational says they shouldn't. We're not interested in being able to
improve the speed at which the db can find data to respond to an app
request - with an access factor of 1.05 (actually, it's nearer 1.02 or
1.03) we consider any effort there to be a waste of time ...

Basically, the only way you can beat us in the real world is to throw
hardware at the problem - and like I said with linux and macro/micro
kernels, we can do the same :-)

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by

decoupling

logic
from implementation we leave the implementor (DBMS) to do as it feels fit

to

do.

As DBMS technology advances, we get faster systems without having to

change

our

programs.

Can you improve on what I've just done? Is any improvement POSSIBLE?

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

What is "too large"?

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

It does?

Yes. I'll only be in trouble if I'm so short of ram that my working set
gets forced into swap ...

When we design databases we can decouple logical planning from performance
considerations, which, you must agree, are two separate issues.

Yes. BUT what's the point of having a database that is logically
perfect, and who's performance is slow to the point of being unusable?

Don't forget - in practice MultiValue ends up with a database that is
*inherently* optimised such that it almost invariably outperforms an
equivalent SQL database, AND we don't normally have DBAs to help us
achieve that nirvana ...

Frankly, it may well be that PICK systems run faster and cheaper than
relational ones, but certainly
not for the reasons you state.

Well, could you optimise that index any more?

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the circumstances)
assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in
elegance, they believe in beauty. And when I look at relational, all I
see is the theorists pleading "power", "hardware", "brute force", to get
them out of trouble.

You said that logical planning and performance are separate issues. And
I wouldn't expect you to address the above example in a discussion of
relational, because performance is irrelevant to relational.

I would have to know a lot more details to address it properly.
Performance is irrelevant to the model.
It's like E=mc**2. Nice theory and it actually works. But to get
performance out of it
(=exploding bomb) you have to solve lots of practical details. However,
without the theory
you could experiment for a milloin years without being able to build an
atom bomb.

But surely, the fact that I am SUPREMELY CONFIDENT that I can get
superior performance from inferior hardware should give you pause for
thought that maybe, just maybe, the relational model is flawed from an
engineer's or scientist's viewpoint?

That's OK with me. But the most you can claim is that todays
IMPLEMENTATIONS are flawed,
and you would be 100% correct. How would you go and prove that the model
is flawed?
You should prove that a relational DBMS could not POSSIBLY be efficient.

Well, if the relational people insist on divorcing theory from
implementation, it's hard to see how they can prove it is efficient.
While that is exactly what I'm trying to prove for MV. Whether
relational is efficient or not is irrelevant, if I can prove MV is
efficient and you can't prove the same for relational.

If that results in running SQL over MV then we've won, I think :-) We
can do that already ...

From the mathematician's (or logician's) viewpoint I agree it's
flawless. But that's true of plenty of broken scientific theories...

Could you give me some other examples?

Euclidean Geometry - just look at the equatorial event horizon of a
black hole.
Newtons laws of motion - just look at Mercury's orbit.
Quantum mechanics - just look at a black hole.
Relativity - just look at quantum mechanics :-) or Schrodinger's cat.

Actually, it's probably true of pretty much all of theoretical physics
since the start of last century ... in each case the only thing wrong
with the theory is that reality just doesn't happen to agree ...

best regards,
Lauri Pietarinen

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#76Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <$xpsVWAvnCn$Ew5r@thewolery.demon.co.uk>, Anthony W. Youngman
<thewolery@nospam.demon.co.uk> writes

Really, however you calculate it, it is an order of magnitude less
than your alternative.

And please don't tell me that using indexes is not fair or not in the
spirit of the
relational model ;-)

Well, it does result in data being stored multiple times ;-)

And while it maybe doesn't affect the result that much, you wanted the
value? Where has that come from? What if the price changed half way
through the period you're calculating? :-) You've failed to answer your
own question, so maybe I could match you ...

Whoops - sorry - I did notice after I wrote this that you included price
in your index. But it does seem strange indexing on a composite field
like that ...

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#77Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Anthony W. Youngman (#75)
Re: Dreaming About Redesigning SQL

Anthony W. Youngman wrote:

In article <bn72o3$as$1@nyytiset.pp.htv.fi>, Lauri Pietarinen <lauri.pie
tarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

In article <bn4cca$dj0$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

Well, if it is normalised, how easy is it for you to change the
customer_id of an order? Anyway,

Incredibly easy. Just update the "customer_id" field of the invoice
record. A single change to a single "row"

And I presume the system will automatically move all related stuff
(order details etc.) into
the same block as the new customer? How long will that take? What if
there is no room for it there?

Well, I'd view an order as an entity. As such, I would give it its own
FILE, and your question doesn't make sense.

But then your formula for disk head movements does not make sense either!

But if the system did move
the stuff, it would be four disk accesses - read/write to delete the old
entry, read/write to save the new. As for "enough room" - well - it'll
fall over if we have a "disk full" (or it might not).

"Not enough room" here means not enought room in the block of the
customer (from which you
were supposed to get all data in one read, or disk head movement). That
would mean that your
order information would be moved perhaps to another block and result in
an extra head movement,
or am I right?

If we're indexed on order
detail. If Hammer appears in N invoices, then T = (1+N) * ST * 1.05 for
hammers, and the same for all the other products.

Theory favours us, in that if a product appears X times in one invoice,
that's one read for us and X for you, but hardware will probably help
you more than us (that is, assuming thrashing cuts in) in that you stand
a marginally higher chance of getting multiple instances of a product in
any given read.

So for each product you get T = (1+N) * ST * 1.05.

Now, for our SQL-DBMS, presuming that we build indexes for detail and
product:

order_detail(product_id, qty, unit_price) = 20 bytes/row
product(product_id, product_name) = 50 bytes/row

With 2 disk reads I would get
8K/20 = 400 order detail rows and
8K/50 = 160 product rows

Since all rows are in product_id order, no need for random disk reads so
T = 1 + N/400 + P/160 (N=number of details, P=number of products)
for ALL products and details.

And, because of sequential prefetch, we probably would not have to wait
for I/O's at all.

Really, however you calculate it, it is an order of magnitude less
than your alternative.

And please don't tell me that using indexes is not fair or not in the
spirit of the
relational model ;-)

Well, it does result in data being stored multiple times ;-)

What on earth is wrong with that? Do you know how much 160GB of disk
cost's today?
I could ask: does your system work in, say 4KB? That's how much memory
the first
computer I used (a Wang 2000) had. Probably it would not work at
all. In the 50's
they did amazing things with hardly any compilers and very little
memory. I am referring
to Whirlwind. See http://www.cedmagic.com/history/whirlwind-computer.html.
Could you have done that with MV? My point? Why are we discussing
restrictions
to memory and CPU speed of the 70's and 80's? If an SQL DBMS uses more
memory
and disk, and it is available, why complain about *that*. Im not
impying that you
cannot complain about other matters, e.g. ease of development etc. and
you might
even be right. Be it as it is, I am not trying to make you abandon
your MV database.

And while it maybe doesn't affect the result that much, you wanted the
value? Where has that come from?

From e.g.
select p.product_id, product_name, sum(qty*unit_price)
from product, order_detail od
where p.product_id = od.product_id
group by p.product_id, product_name

This is the SQL statement that will result in

1 + N/400 + P/160 disk reads
(if rows not found in cache)

What if the price changed half way
through the period you're calculating?

Which price? The price that has already been paid by customer?

:-) You've failed to answer your
own question, so maybe I could match you ...

How have I failed?

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

No, I want you to give me a list of all your customers. How many disk
reads?

T = N * 1.05 where N is the number of customers. What do you want to
know about those customers? Address? Phone number*s*? Anything else?
That's *all* at no extra cost.

Well, no thanks. I just wanted their names this time.
The relational alternative, with an index on customer_name, would be
again an order
of magnitune less disk reads.

Well, if you let me use an index here, I'm sorry, GAME OVER! The best
you can do would be a photo finish.

Assuming an overhead of, say, 4 bytes per index entry, the entire index
would be

Size = 4 * N + sigma(name_length) + sigma(key_length)

Okay, I've probably got some padding there as well, but so will you. And
note I didn't say N * field_length, I said sigma(name_length). ;-)

What is sigma(key_length)?

I notice that at no point have you asked where this strange 1.05 keeps
coming from. That's why I keep hammering performance ... okay, maybe
I've lost this "order detail" but it's why I keep hammering my
confidence in general.

Sorry, I lost you here?

Given that your data is all ordered optimally for
answering this "detail" request, what's it going to cost you in time or
disk space to answer the request "please recreate invoice X for me"?

Well, first of all, the fact that I optimized the "detail" request does
not cost me anything regarding
the other queries. It will impose a slight cost on updates, however
that would be hardly noticable
execpt for mass updates (update batch jobs) that change the column value.

MV stores data efficently - look at how little space the index took :-)

It accesses data efficiently - that 1.05 is actually the "how many
places do I need to look" value for the database to respond to a
userland request, given a known primary key or index value. Okay - that
means we push back at the programmer some of the management of data
access, but why should that be solely the response of the dbms? If it
makes sense for the app to do it, then it should ... why should the dbms
have to guess at how to optimise a request if the app has all the
necessary information at its fingertips?

1) Your database might change over time and say a table that originally
had only a few rows
could suddenty grow considerably. Now an optimiser would insulate you
from these changes
or in the worst case all that would need to be done would be to create
an index (and, yes, check
that the DBMS starts using it).

2) You might have a product that runs in a number of sites: large ones
and small
ones. Now you would not have to reoptimise the programs for each type site.

3) Complex SQL-queries do quite a lot of things and it might not be very
obvious for
the programmer how to optimise best.

4) depending on input from user (say, a search screen) the optimal
access path may be different. An optimiser
could generate a different path depending on this input.

But surely, your requirement for grabbing data across multiple invoices
is statistically unusual.

You mean the product department would not be interested in seeing how
their products have
been selling, and to whom?

And I benefit just as much as you from any ram
being available to cache :-) although I wouldn't benefit so much from
prefetch. The probability is that consecutive requests for data are
either "can I know something else about the entity I've just looked at",
or "can I access another entity at random".

However, if you think of all data relating to a customer, that could
amount to, say, 300KB, if
he had a long history. Do you think it is a good idea to pull all that
into memory just in case
the user want's to see his history for all 10 recent years? And there
is lot's of different kinds
of information related to customers. Would the user want to see
everything? Isn't it more
probable that a spesific user want's a certain *view* of the customer?

In the former case, if you've stored it in another table, it's another
request from the app to the dbms. With MV, it all came in the first
request. In the latter case, this is where my 1.05 factor cuts in - bear
in mind even for a simple btree file, this factor is only 1 for a
1-level root only file - it goes up to 1.5 when the root bucket splits
and keeps rising from there :-)

So as an engineer, here I am appealing to stats :-) But this is the real
world, and no stats? no real world! Because we have no optimiser, it
encourages the programmer to optimise - I've heard various people say
that if you want a SQL-using app to run fast you mustn't use views -
forcing the programmer to interest themselves in the db in a manner that
relational says they shouldn't.

You are correct. But those result mostly from irregularities of SQL (so
I have been told). They result
from the fact that SQL does *not* follow the theory, but instead was
changed to be more "practical".

The story of views is not quite so simple. In some cases views are very
useful and can be used safely.
In other situations you might get ugly surprises.

Take the view

create view california_customers as
select cust_id, cust_name
from customer
where state = 'CA'

The query
select cust_id, from cust_name
from california_customers
where cust_name like 'A%'

Will be transformed (under the covers) into
select cust_id, cust_name
from customer
where state = 'CA' and
cust_name like 'A%'

But in other cases the view will first be materialised into
a temporary table and the rest of the query would be
evaluated on that temp table.

Understanding when a DBMS knows how to do what
is not simple, and, hence, you are quite correct in
your observation.

We're not interested in being able to
improve the speed at which the db can find data to respond to an app
request - with an access factor of 1.05 (actually, it's nearer 1.02 or
1.03) we consider any effort there to be a waste of time ...

But isn't it better to have NO disk reads than one? I thought disk I/O
was rather expensive? With
that mentality you will always be disk bound.

Basically, the only way you can beat us in the real world is to throw
hardware at the problem - and like I said with linux and macro/micro
kernels, we can do the same :-)

Well, please do!

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...

The theory, indeed, does not say anything about buffer pools, but by

decoupling

logic

from implementation we leave the implementor (DBMS) to do as it feels fit

to

do.

As DBMS technology advances, we get faster systems without having to

change

our

programs.

Can you improve on what I've just done? Is any improvement POSSIBLE?

The whole trick is to have as much stuff in memory as possible, and
preferably the *right* stuff. Even
if we have a small memory the most used rows will remain in memory and
hence minimise the need
for disk I/O. And writing to disk is nearly always asyncronous so it
will not affect response time.
"The only good I/O is a dead I/O" ;-)

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

What is "too large"?

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

It does?

Yes. I'll only be in trouble if I'm so short of ram that my working set
gets forced into swap ...

What if your customer has accumulated, over the years, say 1000 orders?
Would you want to pollute
your cache with all those orders? Note that this is a problem that you
will first accounter after the
system has been running for quite a long time. In MV, what would you
do in a situation like this?

When we design databases we can decouple logical planning from performance
considerations, which, you must agree, are two separate issues.

Yes. BUT what's the point of having a database that is logically
perfect, and who's performance is slow to the point of being unusable?

Don't forget - in practice MultiValue ends up with a database that is
*inherently* optimised such that it almost invariably outperforms an
equivalent SQL database, AND we don't normally have DBAs to help us
achieve that nirvana ...

Frankly, it may well be that PICK systems run faster and cheaper than
relational ones, but certainly
not for the reasons you state.

Well, could you optimise that index any more?

Which index?

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the circumstances)
assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in
elegance, they believe in beauty. And when I look at relational, all I
see is the theorists pleading "power", "hardware", "brute force", to get
them out of trouble.

You said that logical planning and performance are separate issues. And
I wouldn't expect you to address the above example in a discussion of
relational, because performance is irrelevant to relational.

I would have to know a lot more details to address it properly.
Performance is irrelevant to the model.
It's like E=mc**2. Nice theory and it actually works. But to get
performance out of it
(=exploding bomb) you have to solve lots of practical details. However,
without the theory
you could experiment for a milloin years without being able to build an
atom bomb.

But surely, the fact that I am SUPREMELY CONFIDENT that I can get
superior performance from inferior hardware should give you pause for
thought that maybe, just maybe, the relational model is flawed from an
engineer's or scientist's viewpoint?

That's OK with me. But the most you can claim is that todays
IMPLEMENTATIONS are flawed,
and you would be 100% correct. How would you go and prove that the model
is flawed?
You should prove that a relational DBMS could not POSSIBLY be efficient.

Well, if the relational people insist on divorcing theory from
implementation, it's hard to see how they can prove it is efficient.
While that is exactly what I'm trying to prove for MV. Whether
relational is efficient or not is irrelevant, if I can prove MV is
efficient and you can't prove the same for relational.

Well, I know of a lot of large banks, insurance companies etc... that
are using SQL DBMS'es
and I think they are running just fine. Amazon uses Oracle. Would you
say that their performance
is adequate? And I have first hand witnessed and built lot's of fast
systems that use SQL DBMS'es.

If that results in running SQL over MV then we've won, I think :-) We
can do that already ...

Not really, because it's the SQL that is the "relational" part (well,
it's not purely relational). So
the funny thing is that, what ever lies below the surface (=whatever
engine we are using) relational
get's all the credit!! Unfair, isn't it? As long as it LOOKS like its
relational to the user, it does not really matter
what happens under the hood.

From the mathematician's (or logician's) viewpoint I agree it's

flawless. But that's true of plenty of broken scientific theories...

Could you give me some other examples?

Euclidean Geometry - just look at the equatorial event horizon of a
black hole.
Newtons laws of motion - just look at Mercury's orbit.
Quantum mechanics - just look at a black hole.
Relativity - just look at quantum mechanics :-) or Schrodinger's cat.

Actually, it's probably true of pretty much all of theoretical physics
since the start of last century ... in each case the only thing wrong
with the theory is that reality just doesn't happen to agree ...

Are you suggesting that Newtons theories are totally useless and irrelevant?

kindest regards,
Lauri Pietarinen

#78Lauri Pietarinen
lauri.pietarinen@atbusiness.com
In reply to: Anthony W. Youngman (#76)
Re: Dreaming About Redesigning SQL

Anthony W. Youngman wrote:

In article <$xpsVWAvnCn$Ew5r@thewolery.demon.co.uk>, Anthony W. Youngman
<thewolery@nospam.demon.co.uk> writes

Really, however you calculate it, it is an order of magnitude less
than your alternative.

And please don't tell me that using indexes is not fair or not in the
spirit of the
relational model ;-)

Well, it does result in data being stored multiple times ;-)

And while it maybe doesn't affect the result that much, you wanted the
value? Where has that come from? What if the price changed half way
through the period you're calculating? :-) You've failed to answer your
own question, so maybe I could match you ...

Whoops - sorry - I did notice after I wrote this that you included price
in your index.

OK!

But it does seem strange indexing on a composite field
like that ...

But why does it seem strange?

regards,
Lauri

#79Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:bnhk4n$i3t$1@nyytiset.pp.htv.fi...

Anthony W. Youngman wrote:

In article <bn72o3$as$1@nyytiset.pp.htv.fi>, Lauri Pietarinen <lauri.pie
tarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

In article <bn4cca$dj0$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

If we're indexed on order
detail. If Hammer appears in N invoices, then T = (1+N) * ST * 1.05 for
hammers, and the same for all the other products.

Theory favours us, in that if a product appears X times in one invoice,
that's one read for us and X for you, but hardware will probably help
you more than us (that is, assuming thrashing cuts in) in that you

stand

a marginally higher chance of getting multiple instances of a product

in

any given read.

So for each product you get T = (1+N) * ST * 1.05.

Now, for our SQL-DBMS, presuming that we build indexes for detail and
product:

order_detail(product_id, qty, unit_price) = 20 bytes/row
product(product_id, product_name) = 50 bytes/row

With 2 disk reads I would get
8K/20 = 400 order detail rows and
8K/50 = 160 product rows

Since all rows are in product_id order, no need for random disk reads so
T = 1 + N/400 + P/160 (N=number of details, P=number of products)
for ALL products and details.

And, because of sequential prefetch, we probably would not have to wait
for I/O's at all.

Really, however you calculate it, it is an order of magnitude less
than your alternative.

And please don't tell me that using indexes is not fair or not in the
spirit of the
relational model ;-)

Well, it does result in data being stored multiple times ;-)

What on earth is wrong with that? Do you know how much 160GB of disk
cost's today?

Lauri,

Remember who you are talking to. Wol is ignorant and stupid. Somehow he
thinks managed redundancy at the physical level is non-relational because
normalization seeks to reduce redundancy at the logical level.

You have to keep in mind that the man is totally incompetent to comprehend
simple english let alone basic principles of data management.

Regards,
Bob

#80Bob Badour
bbadour@golden.net
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

"Lauri Pietarinen" <lauri.pietarinen@atbusiness.com> wrote in message
news:bnhkeh$i3t$2@nyytiset.pp.htv.fi...

Anthony W. Youngman wrote:

In article <$xpsVWAvnCn$Ew5r@thewolery.demon.co.uk>, Anthony W. Youngman
<thewolery@nospam.demon.co.uk> writes
But it does seem strange indexing on a composite field
like that ...

But why does it seem strange?

He only knows one product and only a handful of recipes for using that
product. Everything else seems strange because it lies outside the tightly
confined cognitive box from which he views the world.

#81Christopher Browne
cbbrowne@acm.org
In reply to: Christopher Browne (#2)
Re: Dreaming About Redesigning SQL

"Anthony W. Youngman" <thewolery@nospam.demon.co.uk> wrote:

In article <mhMlb.2417$9E1.18525@attbi_s52>, Marshall Spight
<mspight@dnai.com> writes

Unless one has data independence, one does not have
this option; one will be locked into a particular
performance model. This is why I found the MV
guy's obvious pleasure at being able to precisely
describe the performance model for his DB as odd:
I thought it a deficit to be able to say what it was;
he thought it an asset.

When you park your car, do you put the chassis on the drive, the
engine in the garage, and the wheels in the front garden?

When I park my car, I don't particularly _care_ whether it runs on
propane, diesel, gasoline, ethanol, or batteries. (Well, at home,
they don't allow propane cars in the parking garage, but that's a case
where details HAVE to emerge.) I don't need to care whether the car
uses a 4 cylinder engine, 6, 8, 12, or perhaps evades having cylinders
at all.

I frankly have NO IDEA how many RPMs the engine gets to, nor do I know
how many times the wheels turn in the average minute.

These are all details I don't NEED to know in order to park the car,
and are pretty much irrelevant to the average need to drive an
automobile.

I consider it a Good Thing that my database has a query optimizer that
makes it unnecessary for me to worry about the details of how indexes
will be used.

Occasionally some anomaly comes up that requires that I dig into
details, but most of the time, the abstractions allow me to ignore
these details, and allows me to spend my time worrying about
optimizing the things that actually need it, as opposed to chasing
after irrelevant improvements.
--
select 'cbbrowne' || '@' || 'cbbrowne.com';
http://cbbrowne.com/info/linux.html
ASSEMBLER is a language. Any language that can take a half-dozen
keystrokes and compile it down to one byte of code is all right in my
books. Though for the REAL programmer, assembler is a waste of
time. Why use a compiler when you can code directly into memory
through a front panel.

#82Bob Badour
bbadour@golden.net
In reply to: Christopher Browne (#2)
Re: Dreaming About Redesigning SQL

"Christopher Browne" <cbbrowne@acm.org> wrote in message
news:bni0s0$10g8ah$1@ID-125932.news.uni-berlin.de...

"Anthony W. Youngman" <thewolery@nospam.demon.co.uk> wrote:

In article <mhMlb.2417$9E1.18525@attbi_s52>, Marshall Spight
<mspight@dnai.com> writes

Unless one has data independence, one does not have
this option; one will be locked into a particular
performance model. This is why I found the MV
guy's obvious pleasure at being able to precisely
describe the performance model for his DB as odd:
I thought it a deficit to be able to say what it was;
he thought it an asset.

When you park your car, do you put the chassis on the drive, the
engine in the garage, and the wheels in the front garden?

When I park my car, I don't particularly _care_ whether it runs on
propane, diesel, gasoline, ethanol, or batteries.

Christopher,

You have to remember who you are talking to; Wol is ignorant and stupid. A
car is a physical artifact just as the physical representation of a datum is
a physical artifact. Physical independence is the equivalent to having a
door from the hallway to the garage, a door from the kitchen to the garage,
a door from the back yard to the garage, and car access to the driveway--and
an identical car parked in the back alley just for convenience.

Wol's analogies are dumb because they reflect his intelligence.

#83Tom Lane
tgl@sss.pgh.pa.us
In reply to: Yurgis Baykshtis (#51)
Re: Help!!! FreeSpaceMap hashtalble out of memory.

"Yurgis Baykshtis" <ybaykshtis@micropat.com> writes:

In pgerr.log this always go together:
WARNING: ShmemAlloc: out of memory
ERROR: FreeSpaceMap hashtable out of memory

If you have a large number of tables in your database, it might be that
you need to increase max_locks_per_transaction.

regards, tom lane

#84Anthony W. Youngman
thewolery@nospam.demon.co.uk
In reply to: Noname (#1)
Re: Dreaming About Redesigning SQL

In article <bnhk4n$i3t$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

In article <bn72o3$as$1@nyytiset.pp.htv.fi>, Lauri Pietarinen <lauri.pie
tarinen@atbusiness.com> writes

Anthony W. Youngman wrote:

In article <bn4cca$dj0$1@nyytiset.pp.htv.fi>, Lauri Pietarinen
<lauri.pietarinen@atbusiness.com> writes

Well, if it is normalised, how easy is it for you to change the
customer_id of an order? Anyway,

Incredibly easy. Just update the "customer_id" field of the invoice
record. A single change to a single "row"

And I presume the system will automatically move all related stuff
(order details etc.) into
the same block as the new customer? How long will that take? What if
there is no room for it there?

Well, I'd view an order as an entity. As such, I would give it its own
FILE, and your question doesn't make sense.

But then your formula for disk head movements does not make sense either!

Why not? The order is a real-world "thing", and as such I would have an
ORDERS file, in which each order is a single entry, with "customer_id"
as one of its attributes. "order detail" is an attribute of "order", so
if I change "customer_id" it's the relational equivalent of just
changing one cell in one row. The chances of me having to move the
record is pretty near nil, and if I do it won't change bucket so at most
it involves two frames (or disk blocks, if that's what you want to call
them).

But if the system did move
the stuff, it would be four disk accesses - read/write to delete the old
entry, read/write to save the new. As for "enough room" - well - it'll
fall over if we have a "disk full" (or it might not).

"Not enough room" here means not enought room in the block of the
customer (from which you
were supposed to get all data in one read, or disk head movement). That
would mean that your
order information would be moved perhaps to another block and result in
an extra head movement,
or am I right?

Which I've taken in to account - if there isn't enough room in the
original "bucket", I need to either overflow into the next bucket which
might exist, or to create it if it doesn't. Ie two head movements to
delete from the first bucket, and two head movements to add to the
second.

And it will only fall over if I need to create a new bucket and there's
no space left on the disk (or if (and this is very unlikely in this
scenario) it triggers a "split" which again needs space and there's none
left on disk).

Or have you not sussed that we view "order detail" as an attribute of
"order" (which is therefore stored as part of the same thing), but
"customer" is separate from "order", is stored separately, and is linked
by a relationship. (Whereas "order detail" is NOT related to "order",
because they are part of the same thing :-)

Well, it does result in data being stored multiple times ;-)

What on earth is wrong with that? Do you know how much 160GB of disk
cost's today?
I could ask: does your system work in, say 4KB? That's how much memory
the first
computer I used (a Wang 2000) had. Probably it would not work at
all. In the 50's
they did amazing things with hardly any compilers and very little
memory. I am referring
to Whirlwind. See http://www.cedmagic.com/history/whirlwind-computer.html.
Could you have done that with MV? My point? Why are we discussing
restrictions
to memory and CPU speed of the 70's and 80's? If an SQL DBMS uses more
memory
and disk, and it is available, why complain about *that*. Im not
impying that you
cannot complain about other matters, e.g. ease of development etc. and
you might
even be right. Be it as it is, I am not trying to make you abandon
your MV database.

As always, you're relying on hardware to help :-) You know what I think
of that :-) And 160Gb of disk is only cheap if you're using IDE on a
desktop PC - it costs a hell of a lot more for a laptop or SCSI for a
server. And if it's embedded it maybe that the *room* is expensive, not
the capacity ...

And: what if I was just reading customer-data. Would the same formula
apply (= (2+N)*ST*1.05)?

Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)

No, I want you to give me a list of all your customers. How many disk
reads?

T = N * 1.05 where N is the number of customers. What do you want to
know about those customers? Address? Phone number*s*? Anything else?
That's *all* at no extra cost.

Well, no thanks. I just wanted their names this time.
The relational alternative, with an index on customer_name, would be
again an order
of magnitune less disk reads.

Well, if you let me use an index here, I'm sorry, GAME OVER! The best
you can do would be a photo finish.

Assuming an overhead of, say, 4 bytes per index entry, the entire index
would be

Size = 4 * N + sigma(name_length) + sigma(key_length)

Okay, I've probably got some padding there as well, but so will you. And
note I didn't say N * field_length, I said sigma(name_length). ;-)

What is sigma(key_length)?

The pointers from the index to the actual row in the actual table ...

There's no point in having an index if you can't get from the index back
to original full record :-)

I notice that at no point have you asked where this strange 1.05 keeps
coming from. That's why I keep hammering performance ... okay, maybe
I've lost this "order detail" but it's why I keep hammering my
confidence in general.

Sorry, I lost you here?

I'm trying to push the fact that I can get the data I'm looking for at
almost no cost. Okay, I do expect the hardware to cache things for me,
but...

If your data is relational, almost all accesses will be made with known
primary keys. Given that the key is known, I can get at that data, FIRST
time EVERY time (near enough). And hardware helps me as much as it helps
you. And statistics helps me *more* than you - it's a pretty safe bet
that my access is going to retrieve more data of possible future to me
than yours to you.

Given that your data is all ordered optimally for
answering this "detail" request, what's it going to cost you in time or
disk space to answer the request "please recreate invoice X for me"?

Well, first of all, the fact that I optimized the "detail" request does
not cost me anything regarding
the other queries. It will impose a slight cost on updates, however
that would be hardly noticable
execpt for mass updates (update batch jobs) that change the column value.

Although again you're relying on hardware to bale you out ... adding
this index has a very definite cost in disk storage, and using it has a
cost in RAM while it's sitting in memory.

MV stores data efficently - look at how little space the index took :-)

It accesses data efficiently - that 1.05 is actually the "how many
places do I need to look" value for the database to respond to a
userland request, given a known primary key or index value. Okay - that
means we push back at the programmer some of the management of data
access, but why should that be solely the response of the dbms? If it
makes sense for the app to do it, then it should ... why should the dbms
have to guess at how to optimise a request if the app has all the
necessary information at its fingertips?

1) Your database might change over time and say a table that originally
had only a few rows
could suddenty grow considerably. Now an optimiser would insulate you
from these changes
or in the worst case all that would need to be done would be to create
an index (and, yes, check
that the DBMS starts using it).

Except that an optimiser is *irrelevant* to MV. What do we need to be
insulated from? MV doesn't care whether a FILE is 4Kb or 40Gb, the cost
of accessing a single record, AT RANDOM, from within that FILE is almost
identical. Where would we gain from an optimiser? In practice, it would
get in the way and slow us down!

2) You might have a product that runs in a number of sites: large ones
and small
ones. Now you would not have to reoptimise the programs for each type site.

BUT WE DON'T NEED AN OPTIMISER. IT'S A WASTE OF CPU TIME!!! WE
*D*O*N*'*T* *N*E*E*D* ONE!!!

3) Complex SQL-queries do quite a lot of things and it might not be very
obvious for
the programmer how to optimise best.

But a large chunk of SQL's complexity is reassembling a view of an
entity. MV doesn't have that complexity. An MV program views the
database the same way as a programmer views the real world. So it's
pretty obvious to a MV programmer how to optimise things.

4) depending on input from user (say, a search screen) the optimal
access path may be different. An optimiser
could generate a different path depending on this input.

Again, MV views the entity as a whole, so probably we don't need to
generate a "different path" - it's just "get me this entity" regardless
of what we need to know about it.

But surely, your requirement for grabbing data across multiple invoices
is statistically unusual.

You mean the product department would not be interested in seeing how
their products have
been selling, and to whom?

Of course. But do you really want to know how much you've sold of every
product? What if the stuff went out of production 10 years ago? Surely
you'd more likely want to select order detail by invoice date? Etc etc.

Yep, I bet you could create another index, but suddenly, you're sorting
on product_id and selecting on order_date. Yes, RAM is going to make a
hell of a difference, but surely an *efficient* database underneath is
important :-)

And I benefit just as much as you from any ram
being available to cache :-) although I wouldn't benefit so much from
prefetch. The probability is that consecutive requests for data are
either "can I know something else about the entity I've just looked at",
or "can I access another entity at random".

However, if you think of all data relating to a customer, that could
amount to, say, 300KB, if
he had a long history. Do you think it is a good idea to pull all that
into memory just in case
the user want's to see his history for all 10 recent years? And there
is lot's of different kinds
of information related to customers. Would the user want to see
everything? Isn't it more
probable that a spesific user want's a certain *view* of the customer?

Yes. But why would I pull it *all* in? Bear in mind, the fundamental
element in MV is the entity or RECORD which I would equate to the "row"
in SQL (yes I know relational theory says "tuple").

A customer's history is not one entity. It's a collection of entities
(customer detail, invoices, whatever, multiple entities...) and I'd only
pull in the entities that I wanted.

In the former case, if you've stored it in another table, it's another
request from the app to the dbms. With MV, it all came in the first
request. In the latter case, this is where my 1.05 factor cuts in - bear
in mind even for a simple btree file, this factor is only 1 for a
1-level root only file - it goes up to 1.5 when the root bucket splits
and keeps rising from there :-)

So as an engineer, here I am appealing to stats :-) But this is the real
world, and no stats? no real world! Because we have no optimiser, it
encourages the programmer to optimise - I've heard various people say
that if you want a SQL-using app to run fast you mustn't use views -
forcing the programmer to interest themselves in the db in a manner that
relational says they shouldn't.

You are correct. But those result mostly from irregularities of SQL (so
I have been told). They result
from the fact that SQL does *not* follow the theory, but instead was
changed to be more "practical".

The story of views is not quite so simple. In some cases views are very
useful and can be used safely.
In other situations you might get ugly surprises.

And we don't get ugly surprises :-)

Take the view

create view california_customers as
select cust_id, cust_name
from customer
where state = 'CA'

The query
select cust_id, from cust_name
from california_customers
where cust_name like 'A%'

Will be transformed (under the covers) into
select cust_id, cust_name
from customer
where state = 'CA' and
cust_name like 'A%'

But in other cases the view will first be materialised into
a temporary table and the rest of the query would be
evaluated on that temp table.

Understanding when a DBMS knows how to do what
is not simple, and, hence, you are quite correct in
your observation.

We're not interested in being able to
improve the speed at which the db can find data to respond to an app
request - with an access factor of 1.05 (actually, it's nearer 1.02 or
1.03) we consider any effort there to be a waste of time ...

But isn't it better to have NO disk reads than one? I thought disk I/O
was rather expensive? With
that mentality you will always be disk bound.

I'm assuming we don't have sufficient RAM to cache stuff ...

Our mentality is to leave disk caching to the OS. The app says "get me
X". The database knows *exactly* where to look and asks the OS to "get
me disk sector Y". Any OS worth its salt will have that cached if it's
been asked for previously recently. That way, we're only caching stuff
that's been accessed recently. But because for us the "atomic" chunk is
an entity, there's a good chance that stuff has been accessed and is in
cache.

SQL optimisation *seems* to be more "efficient" because it tries to
predict what you're going to want next. But whereas SQL *guesses* that
because you've accessed one order detail, you're likely to want other
order details from the same invoice (a sensible guess), you cannot
compare this to MV because it gives you those order details as a side
effect. In order for MV optimisation to be of any use, it would need to
guess which INVOICE I'm going to access next, and frankly a random
number generator is probably as good an optimiser as any!

Basically, the only way you can beat us in the real world is to throw
hardware at the problem - and like I said with linux and macro/micro
kernels, we can do the same :-)

Well, please do!

We do. Which is why we can smoke any relational db for speed unless the
hardware is big enough to store the entire database in RAM (and even
then we'd beat it for speed :-) (just not that much in absolute terms,
although probably a fair bit in percentages :-)

Can you improve on what I've just done? Is any improvement POSSIBLE?

The whole trick is to have as much stuff in memory as possible, and
preferably the *right* stuff. Even
if we have a small memory the most used rows will remain in memory and
hence minimise the need
for disk I/O. And writing to disk is nearly always asyncronous so it
will not affect response time.
"The only good I/O is a dead I/O" ;-)

Yep. Which is why our attitude of viewing the world as entities means
we're probably going to smoke you. Statistics says the chances of us
being right "by accident" and avoiding the need for i/o is very high.
While you need artificial intelligence - which has a habit of getting
things wrong :-)

And anyway. Aren't you jumping to conclusions? You are *assuming* that
there is such a thing as the "most used rows". In other words, you are
*assuming* that normal db access only accesses a well-defined subset of
the database. What if there is no way of predicting what the user is
going to want next? Your "trick" is worthless ... the last time it was
accessed is likely to be before the latest reboot ...

And, because we view the world with our "atom" of an entity, we almost
certainly stand a better chance than you of related data "just
happening" to be in RAM when we ask for it ...

But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

What is "too large"?

Too large to *preload* *everything* into RAM :-)

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

It does?

Yes. I'll only be in trouble if I'm so short of ram that my working set
gets forced into swap ...

What if your customer has accumulated, over the years, say 1000 orders?
Would you want to pollute
your cache with all those orders? Note that this is a problem that you
will first accounter after the
system has been running for quite a long time. In MV, what would you
do in a situation like this?

Ignore it? Because it's not a problem? The only time it's likely to be a
problem is if the question is "please get all orders for company X". And
even then, provided the ORDERS file is indexed on "customer_id", it's
still just a SINGLE access to the index and we have a list of EVERY
order.

When we design databases we can decouple logical planning from

performance

considerations, which, you must agree, are two separate issues.

Yes. BUT what's the point of having a database that is logically
perfect, and who's performance is slow to the point of being unusable?

Don't forget - in practice MultiValue ends up with a database that is
*inherently* optimised such that it almost invariably outperforms an
equivalent SQL database, AND we don't normally have DBAs to help us
achieve that nirvana ...

Frankly, it may well be that PICK systems run faster and cheaper than
relational ones, but certainly
not for the reasons you state.

Well, could you optimise that index any more?

Which index?

I was thinking of that customer names index you were talking about.
Which basically consists solely of the names, pointers to the records
they come from, and a bit of empty space. And if I know the name, I can
find the master record in two goes - one hit to read the index (from
which I retrieve the record key), and a second hit on the main file to
retrieve the company record.

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the circumstances)
assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in
elegance, they believe in beauty. And when I look at relational, all I
see is the theorists pleading "power", "hardware", "brute force", to get
them out of trouble.

You said that logical planning and performance are separate issues. And
I wouldn't expect you to address the above example in a discussion of
relational, because performance is irrelevant to relational.

I would have to know a lot more details to address it properly.
Performance is irrelevant to the model.
It's like E=mc**2. Nice theory and it actually works. But to get
performance out of it
(=exploding bomb) you have to solve lots of practical details. However,
without the theory
you could experiment for a milloin years without being able to build an
atom bomb.

But surely, the fact that I am SUPREMELY CONFIDENT that I can get
superior performance from inferior hardware should give you pause for
thought that maybe, just maybe, the relational model is flawed from an
engineer's or scientist's viewpoint?

That's OK with me. But the most you can claim is that todays
IMPLEMENTATIONS are flawed,
and you would be 100% correct. How would you go and prove that the model
is flawed?
You should prove that a relational DBMS could not POSSIBLY be efficient.

Well, if the relational people insist on divorcing theory from
implementation, it's hard to see how they can prove it is efficient.
While that is exactly what I'm trying to prove for MV. Whether
relational is efficient or not is irrelevant, if I can prove MV is
efficient and you can't prove the same for relational.

Well, I know of a lot of large banks, insurance companies etc... that
are using SQL DBMS'es
and I think they are running just fine. Amazon uses Oracle. Would you
say that their performance
is adequate? And I have first hand witnessed and built lot's of fast
systems that use SQL DBMS'es.

Mebbe. Why did Temenos (a major supplier of banking software) buy jBASE
then? jBASE is an MV database.

Unfortunately (a) marketing budget counts, and (b) marketing budgets can
also set the agenda. Witness that relational theory completely ignores
performance, and look at the trouble I'm having trying to prove to you
that MV is close to the *THEORETICAL* limit of performance (not helped
by my poor grasp of stats :-)

It is a *mathematical* *proof* that you cannot beat Huffman compression.
It shouldn't be that hard to prove that you can't beat MV. It's just
that we're mostly USERS of databases, not database computer scientists.
And, like me, not skilled in the necessary maths.

If that results in running SQL over MV then we've won, I think :-) We
can do that already ...

Not really, because it's the SQL that is the "relational" part (well,
it's not purely relational). So
the funny thing is that, what ever lies below the surface (=whatever
engine we are using) relational
get's all the credit!! Unfair, isn't it? As long as it LOOKS like its
relational to the user, it does not really matter
what happens under the hood.

Yup, it is unfair :-( And yup, it's happening. The more I read about new
advances in how the underlying relational engines work, the more I see
that they are just copying 30-year-old MV technology :-(

From the mathematician's (or logician's) viewpoint I agree it's
flawless. But that's true of plenty of broken scientific theories...

Could you give me some other examples?

Euclidean Geometry - just look at the equatorial event horizon of a
black hole.
Newtons laws of motion - just look at Mercury's orbit.
Quantum mechanics - just look at a black hole.
Relativity - just look at quantum mechanics :-) or Schrodinger's cat.

Actually, it's probably true of pretty much all of theoretical physics
since the start of last century ... in each case the only thing wrong
with the theory is that reality just doesn't happen to agree ...

Are you suggesting that Newtons theories are totally useless and irrelevant?

No. I'm just suggesting that they DON'T WORK! One only has to look at
the orbit of Mercury to know that's true.

All of those theories work within limits. But if you're stupid enough to
believe that they are accurate, then you deserve everything you get when
you get burnt to a crisp ... as the astronauts would have been had NASA
used them ... (actually, the astronauts would probably have frozen as
they missed the moon and couldn't get home).

kindest regards,
Lauri Pietarinen

Cheers,
Wol
--
Anthony W. Youngman - wol at thewolery dot demon dot co dot uk
Witches are curious by definition and inquisitive by nature. She moved in. "Let
me through. I'm a nosey person.", she said, employing both elbows.
Maskerade : (c) 1995 Terry Pratchett

#85Hannu Krosing
hannu@tm.ee
In reply to: Anthony W. Youngman (#84)
Re: Dreaming About Redesigning SQL

Anthony W. Youngman kirjutas K, 05.11.2003 kell 01:15:

1) Your database might change over time and say a table that originally
had only a few rows
could suddenty grow considerably. Now an optimiser would insulate you
from these changes
or in the worst case all that would need to be done would be to create
an index (and, yes, check
that the DBMS starts using it).

Except that an optimiser is *irrelevant* to MV. What do we need to be
insulated from? MV doesn't care whether a FILE is 4Kb or 40Gb, the cost
of accessing a single record, AT RANDOM, from within that FILE is almost
identical. Where would we gain from an optimiser? In practice, it would
get in the way and slow us down!

getting a single record from any DB ,AT RANDOM, follows the same rules
;)

2) You might have a product that runs in a number of sites: large ones
and small
ones. Now you would not have to reoptimise the programs for each type site.

BUT WE DON'T NEED AN OPTIMISER. IT'S A WASTE OF CPU TIME!!! WE
*D*O*N*'*T* *N*E*E*D* ONE!!!

on slashdot this would be tagged *funny* ;)

3) Complex SQL-queries do quite a lot of things and it might not be very
obvious for
the programmer how to optimise best.

But a large chunk of SQL's complexity is reassembling a view of an
entity.

perhaps "a large chunk of initial perceived complexity of SQL" is
reassembling a view of an entity. You will get over it in a day or two
;)

that is *if * the thing you are after *is* an entity.

MV doesn't have that complexity. An MV program views the
database the same way as a programmer views the real world.

You mean screenfuls of weird green glowing letters running down the
screen leaving slowly fading tracks ?

So it's pretty obvious to a MV programmer how to optimise things.

I've never been very good at optimising the real world - the obvious
optimisations have very limited scope.

4) depending on input from user (say, a search screen) the optimal
access path may be different. An optimiser
could generate a different path depending on this input.

Again, MV views the entity as a whole, so probably we don't need to
generate a "different path" - it's just "get me this entity" regardless
of what we need to know about it.

Not "what we need to know about it" but "what we already know about it".

So it is always a SEQUENTIAL SCAN , non ?

or is there some magic by which you have all "entities" automatically
hashed by each and every attribute (or combination of attributes) ?

We're not interested in being able to
improve the speed at which the db can find data to respond to an app
request - with an access factor of 1.05 (actually, it's nearer 1.02 or
1.03) we consider any effort there to be a waste of time ...

But isn't it better to have NO disk reads than one? I thought disk I/O
was rather expensive? With
that mentality you will always be disk bound.

I'm assuming we don't have sufficient RAM to cache stuff ...

Our mentality is to leave disk caching to the OS. The app says "get me
X". The database knows *exactly* where to look and asks the OS to "get
me disk sector Y".

How does the database map X to Y, without any extra info (meaning extra
disk accesses) ?

If you can always predict your data needs that well, you dont need a
database, all you need is a file system.

Any OS worth its salt will have that cached if it's
been asked for previously recently.

Were you not talking about databases with substantially more data than
fits into RAM ?

That way, we're only caching stuff
that's been accessed recently. But because for us the "atomic" chunk is
an entity, there's a good chance that stuff has been accessed and is in
cache.

depending on your point of view, anything can be an "entity" (or atomic
chunk) ;)

SQL optimisation *seems* to be more "efficient" because it tries to
predict what you're going to want next.

Where do you get your weird ideas about SQL optimisation from ?

But whereas SQL *guesses* that
because you've accessed one order detail, you're likely to want other
order details from the same invoice (a sensible guess), you cannot
compare this to MV because it gives you those order details as a side
effect. In order for MV optimisation to be of any use, it would need to
guess which INVOICE I'm going to access next, and frankly a random
number generator is probably as good an optimiser as any!

So you claim that MV is good for problems you already know the best way
to solve ?

Basically, the only way you can beat us in the real world is to throw
hardware at the problem - and like I said with linux and macro/micro
kernels, we can do the same :-)

Well, please do!

We do. Which is why we can smoke any relational db for speed unless the
hardware is big enough to store the entire database in RAM (and even
then we'd beat it for speed :-) (just not that much in absolute terms,
although probably a fair bit in percentages :-)

I guess this is the same as some ASM programmer claiming he can beat a C
compiler. This may be true for small very specific tasks on a very
well-understood hardware, but usually not in any more general sense.

Also, while it can take upt to one second for a DBMS to oprimise a
query, it usually takes much longer (minutes, hour or even days) for a
human to do the same.

Can you improve on what I've just done? Is any improvement POSSIBLE?

The whole trick is to have as much stuff in memory as possible, and
preferably the *right* stuff. Even
if we have a small memory the most used rows will remain in memory and
hence minimise the need
for disk I/O. And writing to disk is nearly always asyncronous so it
will not affect response time.
"The only good I/O is a dead I/O" ;-)

Yep. Which is why our attitude of viewing the world as entities means
we're probably going to smoke you. Statistics says the chances of us
being right "by accident" and avoiding the need for i/o is very high.
While you need artificial intelligence - which has a habit of getting
things wrong :-)

And anyway. Aren't you jumping to conclusions? You are *assuming* that
there is such a thing as the "most used rows". In other words, you are
*assuming* that normal db access only accesses a well-defined subset of
the database. What if there is no way of predicting what the user is
going to want next? Your "trick" is worthless ... the last time it was
accessed is likely to be before the latest reboot ...

Are you referring to some Win32 database ? Or do MV databases inherently
need rebooting ?

And, because we view the world with our "atom" of an entity, we almost
certainly stand a better chance than you of related data "just
happening" to be in RAM when we ask for it ...

If you read more into RAM than absolutely needed, you *may* stand a
better chance of related data "just happening" to be in RAM when we ask
for it, but you also *may* have just have done unnneccesary i/o and
probably pushed something useful out of cache.

What if your customer has accumulated, over the years, say 1000 orders?
Would you want to pollute
your cache with all those orders? Note that this is a problem that you
will first accounter after the
system has been running for quite a long time. In MV, what would you
do in a situation like this?

Ignore it? Because it's not a problem? The only time it's likely to be a
problem is if the question is "please get all orders for company X".

The question can be much more complicated than that. How about :

"please get all orders for company X that have items Y which are
currently out of stock and which were sold at prices higher than we sold
them to company Z in the same quarter"

(as company X recently merged with company X and we claimed both that
they were getting absolutely lowest prices and so we must be prepared
for damage control).

And
even then, provided the ORDERS file is indexed on "customer_id", it's
still just a SINGLE access to the index and we have a list of EVERY
order.

How is this different from SQL index ?

Does MV have some hithero unknown index types in addition to traditional
btree, hash, bitmap, ... that allows one to get anything with just *one*
*single* <whatever> ?

Frankly, it may well be that PICK systems run faster and cheaper than
relational ones, but certainly
not for the reasons you state.

Well, could you optimise that index any more?

Which index?

I was thinking of that customer names index you were talking about.
Which basically consists solely of the names, pointers to the records
they come from, and a bit of empty space. And if I know the name, I can
find the master record in two goes - one hit to read the index (from
which I retrieve the record key),

How can you get the record key from index with *one* hit ?

The only way to do it would be using perfect hashes, but that would need
constant recalculation of the whole index.

and a second hit on the main file to retrieve the company record.

Well, I know of a lot of large banks, insurance companies etc... that
are using SQL DBMS'es
and I think they are running just fine. Amazon uses Oracle. Would you
say that their performance
is adequate? And I have first hand witnessed and built lot's of fast
systems that use SQL DBMS'es.

Mebbe. Why did Temenos (a major supplier of banking software) buy jBASE
then? jBASE is an MV database.

Perhaps they had developed some software on jBASE, they were the last
customer of jBASE and they wanted to make sure that jBASE is not going
out of business ?

Unfortunately (a) marketing budget counts, and (b) marketing budgets can
also set the agenda. Witness that relational theory completely ignores
performance,

OTOH, I've heard complaints that SQL largely ignores "relational theory"
;-p

and look at the trouble I'm having trying to prove to you
that MV is close to the *THEORETICAL* limit of performance (not helped
by my poor grasp of stats :-)

You are showing us that MV is perhaps 1.5x ahead in speed for simple
well-defined never-changing tasks. For more complex tasks the 1.5X
advantage will evaporate due to Moore's law catching up with development
time, i.e. in MV you spend long enough time to manually program the
complex queries that just waiting for the hardware to cach up will solve
the same problem with no work ...

It is a *mathematical* *proof* that you cannot beat Huffman compression.

I also remember claims that "You can't beat the feeling", but Huffman
compression is one of the weakest I know of.

How do you prove the unavailability of violent means mathematically ?

Do I sound as delusional as this whole thread ?

It shouldn't be that hard to prove that you can't beat MV. It's just
that we're mostly USERS of databases, not database computer scientists.
And, like me, not skilled in the necessary maths.

So you believe anything the salespeople tell you ?

"We have done the math and can assure one can't get any faster" :)

If that results in running SQL over MV then we've won, I think :-) We
can do that already ...

Not really, because it's the SQL that is the "relational" part (well,
it's not purely relational). So
the funny thing is that, what ever lies below the surface (=whatever
engine we are using) relational
get's all the credit!! Unfair, isn't it? As long as it LOOKS like its
relational to the user, it does not really matter
what happens under the hood.

Yup, it is unfair :-( And yup, it's happening. The more I read about new
advances in how the underlying relational engines work, the more I see
that they are just copying 30-year-old MV technology :-(

And they have been doing so for last 40 years :)

I don't thing there is much in basic "technology" that is different
between MV and RDBMS, just that MV puts the user at much lower level and
thus lets/forces one to do more manual labour.

----------------
Hannu