Serious performance problem
Hello,
I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
message
http://fts.postgresql.org/db/mw/msg.html?mid=1035557
Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2. But sorry this is really not enough. The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff. I�ve got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively). To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute. So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in. So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple query
SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;
to the data set I put on
http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.
I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.
Kind regards
Andreas.
Seems that problem is very simple :))
MSSql can do queries from indexes, without using actual table at all.
Postgresql doesn't.
So mssql avoids sequental scanning of big table, and simply does scan of
index which is already in needed order and has very much less size.
Show quoted text
On Mon, Oct 29, 2001 at 01:43:37PM +0100, Tille, Andreas wrote:
Hello,
I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
messagehttp://fts.postgresql.org/db/mw/msg.html?mid=1035557
Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2. But sorry this is really not enough. The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff. I've got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively). To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute. So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in. So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple querySELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;to the data set I put on
http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.Kind regards
Andreas.
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
Andreas -
I took a look at your problem, since I'm sort of in the field,
and would liek to see free solutions spread, as well.
Here's what I see: Your example touches on what can be an achilles
heel for pgsql's current statistical analyzer: selection on data fields
that have a few common values. Often, the indices don't get used, since
a large fraction of the table needs to be scanned, in any case. In
your example, fully 68% of the table fits the where condition.
Here's some timing results on my machine:
Your dataset and query, as written:
real 0m25.272s
user 0m0.090s
sys 0m0.050s
Creating an index on meldekategorie, and forcing it's use with
"set enable_seqscan = off"
real 0m14.743s
user 0m0.070s
sys 0m0.050s
Same, with index on istaktuell:
real 0m26.511s
user 0m0.050s
sys 0m0.060s
Now, with an index on both meldekategorie and istaktuell:
real 0m7.179s
user 0m0.060s
sys 0m0.030s
I think we have a winner. No it's not sub-second, but I improved the time
by 3x just by trying some indices. Note that I _still_ had to force the
use of indices for this one. It's also the first time I've personally seen
a query/dataset that benefits this much from a two-key index.
As another poster replied to you, there is limitation with postgresql's
use of indices that arises from MVCC: even if the only data requested is
that stored in the index itself, the backend must visit the actual tuple
in the table to ensure that it is 'visible' to the current transaction.
How realistic a representation of your real workload is this query? Realize
that more selective, complex queries are where pgsql shines compared to
other RDBMS: the 'fast table scanner' type query that you proposed as your
test don't really let pgsql stretch it's legs. Do you have example timings
from MS-SQL or others?
Ross
On Mon, Oct 29, 2001 at 01:43:37PM +0100, Tille, Andreas wrote:
Hello,
I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
messagehttp://fts.postgresql.org/db/mw/msg.html?mid=1035557
Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2. But sorry this is really not enough. The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff. I?ve got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively). To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute. So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in. So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple querySELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY Hauptdaten_Fall.MeldeKategorie;to the data set I put on
http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.Kind regards
Andreas.
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
--
Ross Reedstrom, Ph.D. reedstrm@rice.edu
Executive Director phone: 713-348-6166
Gulf Coast Consortium for Bioinformatics fax: 713-348-6182
Rice University MS-39
Houston, TX 77005
Hello Andreas,
A possible solution would be:
CREATE TABLE foo AS
SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
Hauptdaten_Fall.MeldeKategorie;
This is suitable if your data does not change often. To get automatic updates:
1) Define iPrecision, the precision that you need (integer).
2) Create a trigger which increases a counter when a record is updated or
inserted. When the counter reaches iPrecision, do a DROP TABLE foo + CREATE
TABLE foo AS SELECT Hauptdaten_Fall.MeldeKategorie,
Count(Hauptdaten_Fall.ID)... This will take a few seconds but only once.
Run a batch script within a time frame (1 hour, 4 hours, 1 day ?) so a
human user has very little chance to reach iPrecision.
On 300.000 records, you will get instant results. There are plenty of
tricks like this one. If you employ them, you will ***never*** reach the
limits of a double Pentium III computer with U3W discs.
If you need to answer this message, please reply on
pgsql-general@postgresql.org.
Cheers,
Jean-Michel POURE
At 13:43 29/10/01 +0100, you wrote:
Show quoted text
Hello,
I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
messagehttp://fts.postgresql.org/db/mw/msg.html?mid=1035557
Now I tried a snapshot of version 7.2 and got an increase of speed of
about factor 2. But sorry this is really not enough. The very simple
test I pointed to in my mail is even much to slow and the issue would
probably spoil down the whole project which should be a complete open
source solution and would perhaps and in any M$ stuff. I�ve got under
heavy preasur from my employer who was talking about the nice world
of MS .net (while he is using MS-SQL exclusively). To make the thing
clear the issue is the gal database of infectious diseases in Germany
runned by the Robert Koch-Institute. So the beast could be of some
importance for increasing the acceptance of PostgreSQL and Open Source
in the field of medicine which is generally known for the money which
is involved in. So I really hope that some skilled programmers would
be able to find a good way to solve the performance issue perhaps by
just profiling the simple querySELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS
Anz FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
Hauptdaten_Fall.MeldeKategorie;to the data set I put on
http://www.foodborne-net.de/~tillea/ifsg/ifsgtest.dump.bz2
If this should take less than half a second on a modern PC I could
continue to try mo realistic queries.I really hope that I could readjust the image of PostgreSQL in the
eyes of my M$-centered colleagues.Kind regards
Andreas.
---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?
On Mon, 29 Oct 2001, Jean-Michel POURE wrote:
A possible solution would be:
CREATE TABLE foo AS
SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
Hauptdaten_Fall.MeldeKategorie;
Sorry, this is NO solution of my problem.
On 300.000 records, you will get instant results. There are plenty of
tricks like this one. If you employ them, you will ***never*** reach the
limits of a double Pentium III computer with U3W discs.
It is really no help if I solve the speed issue of this *very simple,
zeroth order try*. I repeat a hava a plenty of queries which do much
more complicated stuff than this. This is just a rude strip down of the
problem fit for debugging/profg issues of the database *server*. Simple
tricks on a simple example do not help.
If you need to answer this message, please reply on
pgsql-general@postgresql.org.
No, because ...
I discussed a problem concerning the speed of PostgreSQL compared to
MS SQL server heavily on postgres-general list. The thread starts with
message
I did so and got the explicit advise of Tom to ask here.
Consider the problem as a benchmark. I would love to see postgresql
as the winner.
Kind regards
Andreas.
On Mon, 29 Oct 2001, Ross J. Reedstrom wrote:
Here's what I see: Your example touches on what can be an achilles
heel for pgsql's current statistical analyzer: selection on data fields
that have a few common values. Often, the indices don't get used, since
a large fraction of the table needs to be scanned, in any case. In
your example, fully 68% of the table fits the where condition.
...I think we have a winner. No it's not sub-second, but I improved the time
by 3x just by trying some indices. Note that I _still_ had to force the
use of indices for this one. It's also the first time I've personally seen
a query/dataset that benefits this much from a two-key index.
This is true for this example and I also played with indices as you. I also
enforced the index scan and compared with forbidding the index scan. The
result was on my more realistic examples that both versions performed quite
the same. There was no *real* difference. For sure in this simple query there
is a difference but the real examples showed only 2% - 5% speed increase
(if not slower with enforcing index scans!).
As another poster replied to you, there is limitation with postgresql's
use of indices that arises from MVCC: even if the only data requested is
that stored in the index itself, the backend must visit the actual tuple
in the table to ensure that it is 'visible' to the current transaction.
Any possibility to switch of this temporarily for certain queries like this
if the programmer could make sure that it is not necessary? Just a stupid
idea from a bloody uneducated man in database-engeniering.
How realistic a representation of your real workload is this query? Realize
that more selective, complex queries are where pgsql shines compared to
other RDBMS: the 'fast table scanner' type query that you proposed as your
test don't really let pgsql stretch it's legs. Do you have example timings
from MS-SQL or others?
Unfortunately the four test we did here seemed all to suffer from the
same problem. The situation is that there is a given database structure
which was developed over more than a year on MS-SQL and has a Access GUI.
Now parts of the UI should be made public via web (I want to use Zope)
and I just imported the data and did some example queries with the
terrible slow result.
Kind regards and thanks for your ideas
Andreas.
On Mon, 29 Oct 2001, Vsevolod Lobko wrote:
Seems that problem is very simple :))
MSSql can do queries from indexes, without using actual table at all.
Postgresql doesn't.So mssql avoids sequental scanning of big table, and simply does scan of
index which is already in needed order and has very much less size.
Hmmm, could anyone imagine a simple or not *solution* of the Problem.
I�m thinking of some switch the database programmer could use if he
really knows what he is doing.
Kind regards
Andreas.
CREATE TABLE foo AS
SELECT Hauptdaten_Fall.MeldeKategorie, Count(Hauptdaten_Fall.ID) AS Anz
FROM Hauptdaten_Fall WHERE (((Hauptdaten_Fall.IstAktuell)=20))
GROUP BY Hauptdaten_Fall.MeldeKategorie ORDER BY
Hauptdaten_Fall.MeldeKategorie;Sorry, this is NO solution of my problem.
Allo Andreas,
For every problem there is a solution. That is what software optimization
is all about. Do you make use of PL/pgSQL stored queries in your database?
If not, you will probably end up with terrible nested queries that will
eat-up server time and power.
Best regards,
Jean-Michel POURE
"Tille, Andreas" wrote:
On Mon, 29 Oct 2001, Ross J. Reedstrom wrote:
Here's what I see: Your example touches on what can be an achilles
heel for pgsql's current statistical analyzer: selection on data fields
that have a few common values. Often, the indices don't get used, since
a large fraction of the table needs to be scanned, in any case. In
your example, fully 68% of the table fits the where condition.
...I think we have a winner. No it's not sub-second, but I improved the time
by 3x just by trying some indices. Note that I _still_ had to force the
use of indices for this one. It's also the first time I've personally seen
a query/dataset that benefits this much from a two-key index.This is true for this example and I also played with indices as you. I also
enforced the index scan and compared with forbidding the index scan. The
result was on my more realistic examples that both versions performed quite
the same. There was no *real* difference. For sure in this simple query there
is a difference but the real examples showed only 2% - 5% speed increase
(if not slower with enforcing index scans!).
I studied his dataset and found that a simple count(*) on whole table
took 1.3 sec on my Celeron 375 so I'm sure that the more complex query,
which has to visit 2/3 of tuples will not be able to execute under 1 sec
My playing with indexes / subqueries and query rewriting got the example
query (actually a functional equivalent) to run in ~5 sec with simple
aggregate(group(indexscan))) plan and I suspect that this is how fast
it will be on my hardware
It could probably be soon possible to make it run in ~ 1.5 by using an
aggregate
function that does a sequential scan and returns a rowset.
As another poster replied to you, there is limitation with postgresql's
use of indices that arises from MVCC: even if the only data requested is
that stored in the index itself, the backend must visit the actual tuple
in the table to ensure that it is 'visible' to the current transaction.Any possibility to switch of this temporarily for certain queries like this
if the programmer could make sure that it is not necessary? Just a stupid
idea from a bloody uneducated man in database-engeniering.
There have been plans to set aside a bit in index that would mark the
deleted
tuple. Unfortunately this helps only in cases when there are many
deleted tuples
and all live tuples have to be checked anyway ;(
--------------
Hannu
On Mon, 29 Oct 2001, Vsevolod Lobko wrote:
Seems that problem is very simple :))
MSSql can do queries from indexes, without using actual table at all.
Postgresql doesn't.So mssql avoids sequental scanning of big table, and simply does scan of
index which is already in needed order and has very much less size.
I forewarded this information to my colleague and he replied the following
(im translating from German into English):
hc> I expected this problem. But what is the purpose of an index: Not
hc> to look into the table itself. Moreover this means that the expense
hc> grows linear with the table size - no good prospect at all (the
hc> good thing is it is not exponential :-)).
I have to explain that we are in the *beginning* of production process.
We expect a lot more of data.
hc> In case of real index usage the expense grows only with log(n).
hc> No matter about the better philosophy of database servers, MS-SQL-Server
hc> has a consequent index usage and so it is very fast at many queries.
hc> When performing a query to a field without index, I get a slow
hc> table scan. This is like measuring the speed of the harddisk and
hc> the cleverness of the cache.
The consequence for my problem is now: If it is technically possible
to implement index scans without table lookups please implement it. If
not we just have to look for another database engine which does so,
because our applictaion really need the speed on this type of queries.
I repeat from my initial posting: The choice of the server for our
application could have importance for many projects in the field of
medicine in Germany. I really hope that there is a reasonable solution
which perhaps could give a balance between safety and speed. For
example I can assure in my application that the index, once created
will be valid, because I just want to read in a new set of data once
a day (from the MS-SQL Server which collects data over the day). So
I could recreate all indices after the import and the database is
readonly until the next cron job. Is there any chance to speed up
those applications?
Kind regards
Andreas.
Import Notes
Resolved by subject fallback
I also
enforced the index scan and compared with forbidding the index scan.
The
result was on my more realistic examples that both versions performed
quite
the same. There was no *real* difference. For sure in this simple
query there
is a difference but the real examples showed only 2% - 5% speed
increase
(if not slower with enforcing index scans!).
You could somewhat speed up the query if you avoid that the sort
hits the disk. A simple test here showed, that you need somewhere
near sort_mem = 15000 in postgresql.conf.
Andreas
Import Notes
Resolved by subject fallback
On 30 Oct 2001 at 11:44 (+0100), Tille, Andreas wrote:
| On Mon, 29 Oct 2001, Vsevolod Lobko wrote:
|
| > Seems that problem is very simple :))
| > MSSql can do queries from indexes, without using actual table at all.
| > Postgresql doesn't.
| >
| > So mssql avoids sequental scanning of big table, and simply does scan of
| > index which is already in needed order and has very much less size.
| I forewarded this information to my colleague and he replied the following
| (im translating from German into English):
|
| hc> I expected this problem. But what is the purpose of an index: Not
| hc> to look into the table itself. Moreover this means that the expense
| hc> grows linear with the table size - no good prospect at all (the
| hc> good thing is it is not exponential :-)).
| I have to explain that we are in the *beginning* of production process.
| We expect a lot more of data.
|
| hc> In case of real index usage the expense grows only with log(n).
| hc> No matter about the better philosophy of database servers, MS-SQL-Server
| hc> has a consequent index usage and so it is very fast at many queries.
| hc> When performing a query to a field without index, I get a slow
| hc> table scan. This is like measuring the speed of the harddisk and
| hc> the cleverness of the cache.
|
| The consequence for my problem is now: If it is technically possible
| to implement index scans without table lookups please implement it. If
| not we just have to look for another database engine which does so,
| because our applictaion really need the speed on this type of queries.
| I repeat from my initial posting: The choice of the server for our
| application could have importance for many projects in the field of
| medicine in Germany. I really hope that there is a reasonable solution
| which perhaps could give a balance between safety and speed. For
| example I can assure in my application that the index, once created
| will be valid, because I just want to read in a new set of data once
| a day (from the MS-SQL Server which collects data over the day). So
| I could recreate all indices after the import and the database is
| readonly until the next cron job. Is there any chance to speed up
| those applications?
CREATE INDEX idx_meldekategorie_hauptdaten_f
ON hauptdaten_fall(meldekategorie);
CLUSTER idx_meldekategorie_hauptdaten_f ON hauptdaten_fall;
Aggregate (cost=5006.02..5018.90 rows=258 width=16)
-> Group (cost=5006.02..5012.46 rows=2575 width=16)
-> Sort (cost=5006.02..5006.02 rows=2575 width=16)
-> Seq Scan on hauptdaten_fall (cost=0.00..4860.12 rows=2575 width=16)
This looks much nicer, but is still quite slow. I'm quite sure the
slowness is in the sort(), since all queries that don't sort, return
quickly. I hoped the clustered index would speed up the sort, but
that is not the case.
It _seems_ a simple optimization would be to not (re)sort the tuples
when using a clustered index.
if( the_column_to_order_by_is_clustered ){
if( order_by_is_DESC )
// reverse the tuples to handle
}
I haven't looked at the code to see if this is even feasible, but I
do imagine there is enough info available to avoid an unnecessary
sort on the CLUSTERED index. The only problem I see with this is
if the CLUSTERed index is not kept in a CLUSTERed state as more
records are added to this table.
brent
--
"Develop your talent, man, and leave the world something. Records are
really gifts from people. To think that an artist would love you enough
to share his music with anyone is a beautiful thing." -- Duane Allman
For example I can assure in my application that the index, once created
will be valid, because I just want to read in a new set of data once
a day (from the MS-SQL Server which collects data over the day). So
I could recreate all indices after the import and the database is
readonly until the next cron job. Is there any chance to speed up
those applications?
Hello Andreas,
Is your database read-only? Good point, sorry to insist your problem is
software optimization. In your case, the database may climb up to 200
million rows (1000 days x 200.000 rows). What are you going to do then? Buy
a 16 Itanium computer with 10 Gb RAM and MS SQL Server licence. Have a
close look at your problem. How much time does it get MS SQL Server to
query 200 million rows ? The problem is not in choosing MS SQL or
PostgreSQL ...
If you are adding 200.000 rows data everyday, consider using a combination
of CREATE TABLE AS to create a result table with PL/pgSQL triggers to
maintain data consistency. You will then get instant results, even on 2
billion rows because you will always query the result table; not the
original one. Large databases are always optimized this way because, even
in case of smart indexes, there are things (like your problem) that need
*smart* optimization.
Do you need PL/pgSQL source code to perform a test on 2 billion rows? If
so, please email me on pgsql-general and I will send you the code.
Best regards,
Jean-Michel POURE
At 11:49 30/10/01 +0100, you wrote:
On Tue, 30 Oct 2001, Jean-Michel POURE wrote:
Other examples? Maybe I can help you.
The only thing I could send you would be the whole database structure and
the queries I performed. I�m not allowed to publish confident medical data.
If you think that you could make serious performance tests without living
data I would be happy to send you this stuff.Kind regards
Hello Andreas,
Sorry, I just read your email.
The database structure is fine for me.
I don't need data. Send me a couple of queries.
Do you have a Windows desktop?
If so, please install pgAdmin II from http://www.pgadmin.org.
Kind regards,
Jean-Michel POURE
Import Notes
Reply to msg id not found: Pine.LNX.4.33.0110301147460.6117-100000@wr-linux02.rki.ivbb.bund.deReference msg id not found: 4.2.0.58.20011030111904.00a543c0@pop.freesurf.fr | Resolved by subject fallback
On Tue, 30 Oct 2001, Jean-Michel POURE wrote:
Sorry, this is NO solution of my problem.
^^^^
For every problem there is a solution. That is what software optimization
For sure and as I said I�m really sure that it is a server side programming
issue (and others thought this as well on the hackers list).
But your proposal was not the solution.
is all about. Do you make use of PL/pgSQL stored queries in your database?
If not, you will probably end up with terrible nested queries that will
eat-up server time and power.
Also stored procedures are not the solution and I see no reason to discuss
this topic here on general list (moreover you leave people unclear about
the original question). It was discussed here in two long threads and
belongs to the PostgreSQL programmers. I hope not to violate Tom Lanes
privacy if I quote here one sentence from him in a private posting:
"Performance issues can be discussed in -hackers."
We do not talk about the performance of the datapase application but
about the server itself.
Kind regards
Andreas.
On Tue, 30 Oct 2001, Zeugswetter Andreas SB SD wrote:
You could somewhat speed up the query if you avoid that the sort
hits the disk. A simple test here showed, that you need somewhere
near sort_mem = 15000 in postgresql.conf.
Well this are the usual hints from pgsql-general. I did so and
increased step by step to:
shared_buffers = 131072
sort_mem = 65536
This lead to a double of speed in my tests but this are settings where
an enhancement of memory doesn�t result in a speed increase any more.
When I was posting my question here I was talking about this "tuned"
PostgreSQL server. The default settings where even worse!
Kind regards
Andreas.
thanks for your kind offer to help!
You are welcome.
Just to make clear the situation once more: I�ve got hints on the
general list about how to change my database application to perform
better by using extra tables and indices. I�m also able to do some
database programming myself, even if I�m not very experienced. But
the problem is different: There is an existing database application
which has absolutely no performance problems on MS SQL server. If
I just test the simplest query and it takes orders of magnitude more
than something with the server is wrong.
Moreover chances are low that for this reason changes in the database
existing application are accepted. What if the MS-SQL server would
have performance problems and it comes to tuning of the code there
and we are at the end of speed optimations at our PostgreSQL server.
There is a binary administrative decision: Yes or no. If the same
database application does perform this badly we can *NOT* use PostgreSQL.
The cost of development would be much higher than a second M$-SQL
server license.
Have you done tests on MS SQL with xx million rows. How much data are you
going to have in a single database?
There is a binary technical decission: Yes or no. If there is a
need for at least one additional table for each query of our application
to perform in a comparable manner than we are lost because we are
unable to do that for each new query.
Yes.This is what multi-dimensional databases do.$
This time, we will do it "by hand" in PostgreSQL.
With good naming conventions, this will be clean.
The reason for the performance lack is also clear: It is the different
use of indexes in PostgreSQL. This would have the consequence of a
linear scaling of performance compared to the log(n) scaling when
using indexes strictly. So chances are high that we will run into
performance trouble again even if we now could cope with the problems.
Beware PHP native MS SQL driver do not work well.
Do you have a Windows desktop?
No, I havn�t?
If so, please install pgAdmin II from http://www.pgadmin.org.
Sorry, I see no relation to the question.
This is for programming in PL/pgSQL and administrating PostgreSQL .
Without pgAdmin, you will quickly mess-up with objects (tables, views,
triggers).
Best regards,
Jean-Michel POURE
Import Notes
Reply to msg id not found: Pine.LNX.4.33.0110301319110.6117-100000@wr-linux02.rki.ivbb.bund.de
AFAIK, sorting is necessary even when you have CLUSTERed a table using an index.
Somewhere on the docs I read sth like "CLUSTER reorders the table on disk so that entries
closer on the index are closer on the disk" (obviously written in better English ;-)
But if you INSERT a single row later, it will NOT get inserted to the right place. So
SORT is still necessary.
MAYBE, but I am not sure at all, the sort may take place in less "real" time than in case
the table was not CLUSTERed, as the table is "nearly" sorted.
Hackers, is the sorting algorithm capable of exiting at the very moment the table is
sorted, or are some extra passes always calculated?
Good luck!
Antonio
Brent Verner wrote:
Show quoted text
On 30 Oct 2001 at 11:44 (+0100), Tille, Andreas wrote:
| On Mon, 29 Oct 2001, Vsevolod Lobko wrote:
|
| > Seems that problem is very simple :))
| > MSSql can do queries from indexes, without using actual table at all.
| > Postgresql doesn't.
| >
| > So mssql avoids sequental scanning of big table, and simply does scan of
| > index which is already in needed order and has very much less size.
| I forewarded this information to my colleague and he replied the following
| (im translating from German into English):
|
| hc> I expected this problem. But what is the purpose of an index: Not
| hc> to look into the table itself. Moreover this means that the expense
| hc> grows linear with the table size - no good prospect at all (the
| hc> good thing is it is not exponential :-)).
| I have to explain that we are in the *beginning* of production process.
| We expect a lot more of data.
|
| hc> In case of real index usage the expense grows only with log(n).
| hc> No matter about the better philosophy of database servers, MS-SQL-Server
| hc> has a consequent index usage and so it is very fast at many queries.
| hc> When performing a query to a field without index, I get a slow
| hc> table scan. This is like measuring the speed of the harddisk and
| hc> the cleverness of the cache.
|
| The consequence for my problem is now: If it is technically possible
| to implement index scans without table lookups please implement it. If
| not we just have to look for another database engine which does so,
| because our applictaion really need the speed on this type of queries.
| I repeat from my initial posting: The choice of the server for our
| application could have importance for many projects in the field of
| medicine in Germany. I really hope that there is a reasonable solution
| which perhaps could give a balance between safety and speed. For
| example I can assure in my application that the index, once created
| will be valid, because I just want to read in a new set of data once
| a day (from the MS-SQL Server which collects data over the day). So
| I could recreate all indices after the import and the database is
| readonly until the next cron job. Is there any chance to speed up
| those applications?CREATE INDEX idx_meldekategorie_hauptdaten_f
ON hauptdaten_fall(meldekategorie);
CLUSTER idx_meldekategorie_hauptdaten_f ON hauptdaten_fall;Aggregate (cost=5006.02..5018.90 rows=258 width=16)
-> Group (cost=5006.02..5012.46 rows=2575 width=16)
-> Sort (cost=5006.02..5006.02 rows=2575 width=16)
-> Seq Scan on hauptdaten_fall (cost=0.00..4860.12 rows=2575 width=16)This looks much nicer, but is still quite slow. I'm quite sure the
slowness is in the sort(), since all queries that don't sort, return
quickly. I hoped the clustered index would speed up the sort, but
that is not the case.It _seems_ a simple optimization would be to not (re)sort the tuples
when using a clustered index.if( the_column_to_order_by_is_clustered ){
if( order_by_is_DESC )
// reverse the tuples to handle
}I haven't looked at the code to see if this is even feasible, but I
do imagine there is enough info available to avoid an unnecessary
sort on the CLUSTERED index. The only problem I see with this is
if the CLUSTERed index is not kept in a CLUSTERed state as more
records are added to this table.brent
--
"Develop your talent, man, and leave the world something. Records are
really gifts from people. To think that an artist would love you enough
to share his music with anyone is a beautiful thing." -- Duane Allman---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?
On Tue, 30 Oct 2001, Jean-Michel POURE wrote:
Have you done tests on MS SQL with xx million rows. How much data are you
going to have in a single database?
No because I will not have such a database in the next couple of years.
Why should I? But there are people who claim to do this successfully (not
verified ba myself for sure).
Yes.This is what multi-dimensional databases do.$
This time, we will do it "by hand" in PostgreSQL.
With good naming conventions, this will be clean.
Am I ignorant or why do I fail to see the relation to my original problem
that PostgreSQL performs more than 10 times slower than SQL server for a
*simple* task. I consider it inacceptable for the simple task but you
are talking about hard tasks. I�m afraid we are talking about different
things.
using indexes strictly. So chances are high that we will run into
performance trouble again even if we now could cope with the problems.Beware PHP native MS SQL driver do not work well.
Oh please do not remind me on the consequences if my PostgreSQL efforts
would fail. It would not be PHP. It would be worse. It would be the
technique of the same people who wrote MS-SQL server *shrug*. And not
only this it would mean that *each* web stuff would be done with ASP
stuff because people mean that we should stick to just one web technology
(which is not wrong in principle). My choice was PostgreSQL + Zope ...
May be I would give MySQL a trial before ...
This is for programming in PL/pgSQL and administrating PostgreSQL .
Without pgAdmin, you will quickly mess-up with objects (tables, views,
triggers).
Huh, than I hope we would get a port of it soon ...
But in the moment I had no problems without it.
Kind regards
Andreas.
On Tue, 30 Oct 2001, Antonio Fiol Bonn�n wrote:
AFAIK, sorting is necessary even when you have CLUSTERed a table using an index.
Sorting is not the performance constraint in my example. Just leave out
the sorting and see what happens ...
But if you INSERT a single row later, it will NOT get inserted to the right place. So
SORT is still necessary.
Well rearanging the database in a cronjob after inserting new data once a day
over night would be possible - but I doubt that it makes a big difference.
Kind regards
Andreas.