Questions about indexes?

Started by Ryan Bradetichalmost 23 years ago20 messages

rbradetich@uswest.net

almost 23 years ago

Hello postgres hackers,

Been a while since I have participated on this list ... but I have a new
itch to scratch....

Although the table schema is immaterial, I will provide it so we have a
common framework for this discussion:

host_id integer (not null)
timestamp datetime (not null)
category text (not null) [<= 5 chars]
anomaly text (not null) [<= 1024 chars]

This table is used to store archived data, so each row in the table must
be unique. Currently I am using a primary key across each column to
enforce this uniqueness. This table currently has ~86 million rows and
is 16+ GB in size. This primary key index is also 16+ GB in size,
because it appears all the data is duplicated in the index. (I have
only done some preliminary looking at the database file with strings,
etc ... so this assumption is purly based on these observations).

I am not sure why all the data is duplicated in the index ... but i bet
it has to do with performance since it would save a lookup in the main
table. Is there any benchmarks or papers related to this topic I should
locate and read? I am curious about this because it seems the only
advantaged gained is searching the index for the specified values....
Once the entry is found, the full entry needs to be pulled from the main
table anyhow since the index does not contain all the data. Also with
the increased size, it seems additional pressure would be put on the
shared memory caches (no idea how this really works, just guessing! :))

Since my only requirement is that the rows be unique, I have developed a
custom MD5 function in C, and created an index on the MD5 hash of the
concatanation of all the fields. This has reduced the disk space usage
considerably, as show below against my test database ~6 million rows
at 1+ GB.

All this data is based off the test database running 7.3.2:

Type Size
-------------------------------------------
Database Table 1188642816
All columns pkey 1510252544
MD5 columns pkey 370999296

Just using MD5 hash data instead of all the columns is a considerable
diskspace win going from 1.5 GB to 370 MB.

Has anyone else solved this problem? Has anyone else looked into
something like this and mind sharing so I do not have to re-invent the
wheel? :) Also (assuming there is no papers / benchmarks proving data
in index is a good idea), how difficult would it be to impliment an
index type that extracts the data from the main table?

Thanks for reading. I will be happy to field any question that I can,
or read any papers, research, etc that relates to this topic.

- Ryan

P.S. the production database is running 7.2.4 if that makes a
difference.

--
Ryan Bradetich <rbradetich@uswest.net>

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Ryan Bradetich (#1)

Re: Questions about indexes?

Ryan Bradetich <rbradetich@uswest.net> writes:

Although the table schema is immaterial, I will provide it so we have a
common framework for this discussion:

host_id integer (not null)
timestamp datetime (not null)
category text (not null) [<= 5 chars]
anomaly text (not null) [<= 1024 chars]

This table is used to store archived data, so each row in the table must
be unique. Currently I am using a primary key across each column to
enforce this uniqueness.

It's not real clear to me why you bother enforcing a constraint that the
complete row be unique. Wouldn't a useful constraint be that the first
three columns be unique? Even if that's not correct, what's wrong with
tolerating a few duplicates? You can't tell me it's to save on storage
;-)

I am not sure why all the data is duplicated in the index ... but i bet
it has to do with performance since it would save a lookup in the main
table.

An index that can't prevent looking into the main table wouldn't be
worth anything AFAICS ...

regards, tom lane

rbradetich@uswest.net

almost 23 years ago

In reply to: Tom Lane (#2)

Re: Questions about indexes?

On Sun, 2003-02-16 at 23:34, Tom Lane wrote:

Ryan Bradetich <rbradetich@uswest.net> writes:

Although the table schema is immaterial, I will provide it so we have a
common framework for this discussion:

host_id integer (not null)
timestamp datetime (not null)
category text (not null) [<= 5 chars]
anomaly text (not null) [<= 1024 chars]

This table is used to store archived data, so each row in the table must
be unique. Currently I am using a primary key across each column to
enforce this uniqueness.

It's not real clear to me why you bother enforcing a constraint that the
complete row be unique. Wouldn't a useful constraint be that the first
three columns be unique? Even if that's not correct, what's wrong with
tolerating a few duplicates? You can't tell me it's to save on storage
;-)

The table holds system policy compliance data. The catagory is
basically the policy, and the anomaly is the detailed text explaining
why the system is out of compliance. So the anomaly data is important
(and often the reason why the key is unique). The reason we are
archiving the data is to generate reports and graphs showing policy
compliance over time. Duplicated rows will artifically inflate the
numbers in the reports and graphs. The other option we had was to
perform a DISTINCT select at report / graph time, we chose no to go this
route bacause of the sort added to the query. (Also it just seemed
tidier to only store good data :))

The disk storage is a minor concern :), but I was actually looking at it
as a possible performance enhancement. I am curious how it affects the
shared buffer cache, and also there should be less average pages to read
since the index size was smaller.

Does this make sense? Or am I out in left field again? :)

I am not sure why all the data is duplicated in the index ... but i bet
it has to do with performance since it would save a lookup in the main
table.

An index that can't prevent looking into the main table wouldn't be
worth anything AFAICS ...

Ok, scratch that idea then :) I will continue looking at other ideas
like the MD5 data hashing etc.

Thanks for your input Tom!

- Ryan

regards, tom lane
--
Ryan Bradetich <rbradetich@uswest.net>

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Ryan Bradetich (#3)

Re: Questions about indexes?

Ryan Bradetich <rbradetich@uswest.net> writes:

On Sun, 2003-02-16 at 23:34, Tom Lane wrote:

It's not real clear to me why you bother enforcing a constraint that the
complete row be unique. Wouldn't a useful constraint be that the first
three columns be unique?

The table holds system policy compliance data. The catagory is
basically the policy, and the anomaly is the detailed text explaining
why the system is out of compliance. So the anomaly data is important
(and often the reason why the key is unique).

Well, sure the anomaly is important: it's the payload, the reason why
you bother to have the table in the first place. But that doesn't mean
it's part of the key. Generally the key would be the info you use to
look up a particular anomaly text. In this example, it's not clear to
me why you'd need/want two different anomaly texts entered for the same
host_id and the same category at the same instant of time. ISTM there's
something inadequate about your category column if you need that.

regards, tom lane

rbradetich@uswest.net

almost 23 years ago

In reply to: Tom Lane (#4)

Re: Questions about indexes?

On Mon, 2003-02-17 at 00:15, Tom Lane wrote:

Ryan Bradetich <rbradetich@uswest.net> writes:

On Sun, 2003-02-16 at 23:34, Tom Lane wrote:

It's not real clear to me why you bother enforcing a constraint that the
complete row be unique. Wouldn't a useful constraint be that the first
three columns be unique?

The table holds system policy compliance data. The catagory is
basically the policy, and the anomaly is the detailed text explaining
why the system is out of compliance. So the anomaly data is important
(and often the reason why the key is unique).

Well, sure the anomaly is important: it's the payload, the reason why
you bother to have the table in the first place. But that doesn't mean
it's part of the key. Generally the key would be the info you use to
look up a particular anomaly text. In this example, it's not clear to
me why you'd need/want two different anomaly texts entered for the same
host_id and the same category at the same instant of time. ISTM there's
something inadequate about your category column if you need that.

Ok, I understand what you are asking now :)

Let me make up a contrived example to show how the table is used.

host_id 1 = hosta.somewhere.com
host_id 2 = hostb.somewhere.com

The catagories are coded so (made up examples):
cat p101 = /etc/passwd check
cat f101 = filesystem check.

the table would look like:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.
etc...

So I do not need the anomaly to be part of the index, I only need it to

I agree with you, that I would not normally add the anomally to the
index, except for the unique row requirement. Thinking about it now,
maybe I should guarentee unique rows via a check constraint...

Thanks for making me think about this in a different way!

- Ryan

regards, tom lane

--
Ryan Bradetich <rbradetich@uswest.net>

daniel@digsys.bg

almost 23 years ago

In reply to: Ryan Bradetich (#5)

Re: Questions about indexes?

Ryan Bradetich said:

the table would look like:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.
etc...

So I do not need the anomaly to be part of the index, I only need it to

I agree with you, that I would not normally add the anomally to the
index, except for the unique row requirement. Thinking about it now,
maybe I should guarentee unique rows via a check constraint...

Thanks for making me think about this in a different way!

(sorry this is a bit long)

Ryan,

I use somewhat similarly structured data (archived records of various events)
and when the database was setup (back when this baby was called postgres95), I
too used indexes on all possible fields.

My database consists of an 'operations' table, which holds for the last x days
period (example) and several tables with archived records (per month, or
per-year - see later, The operations table can have frequent updates, which
add new data. Data is never modified but often lookups are made. The archived
tables are generated once and forever from the operations table (possibly
merging in the future, but I haven't yet made my mind on this) - then access
is read-only, although sufficiently frequent.

What I found for the many years of operating this database on different
PostgreSQL versions and hardware is that indexes have considerable cost. :)
So does the need to not miss anything from the operations table (that is,
collect data from many places and have have it all it there).

I ended up with few only indexes on the operations table, because the
processes that fill it up do minimal lookups to see if data is already in the
table, if not do inserts. Then at regular intervals, the table is cleaned up -
that is, a process to remove the duplicate is run. This unfortunately costs
OIDs, but I found no other reasonable way to do the fast inserts. Perhaps the
best way is to create the table without OIDs (but wouldn't this still waste
OIDs?) use COPY and then clean afterwards?

The archived tables are generated, then cleaned up. Then, as Tom suggested
indexes are put on the archived tables, only for the fields that are used in
queries. Once the table is created, there is no way duplicated data will
exist, as it will not be inserted into. Therefore no need for UNIQUE index
enforcement.

If you need to have one large 'history' table, then perhaps you will just have
to do (slow :) selects for each record before each insert, or just insert the
data and then run the cleanup process.

Daniel

Christopher Kings-Lynne

chriskl@familyhealth.com.au

almost 23 years ago

In reply to: Daniel Kalchev (#6)

Re: Questions about indexes?

I ended up with few only indexes on the operations table, because the
processes that fill it up do minimal lookups to see if data is already in the
table, if not do inserts. Then at regular intervals, the table is cleaned up -
that is, a process to remove the duplicate is run. This unfortunately costs
OIDs, but I found no other reasonable way to do the fast inserts. Perhaps the
best way is to create the table without OIDs (but wouldn't this still waste
OIDs?) use COPY and then clean afterwards?

No, WITHOUT OIDS is implemented specifically to not waste OIDs.

Chris

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Ryan Bradetich (#5)

Re: Questions about indexes?

Ryan Bradetich <rbradetich@uswest.net> writes:

the table would look like:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.

Ah, I see your point now. (Thinks: what about separating the "anomaly"
column into an "identifier" and a "complaint" column:

1 | Mon Feb 17 00:34:24 MST 2003 | p101 | x | user has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | y | user has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | y | user has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | /foo | file has improper owner.

No, that doesn't quite work either, unless you are willing to make the
categories more specific. At which point the category and the anomaly
text become equivalent. Actually I'm wondering why you bother with the
category at all; isn't it implied by the anomaly text?)

I agree with you, that I would not normally add the anomally to the
index, except for the unique row requirement. Thinking about it now,
maybe I should guarentee unique rows via a check constraint...

A check constraint won't be efficient either, at least not without a
supporting index. Possibly you could index just the host and timestamp
columns, which would not be unique but it would cut the number of rows
the constraint would need to examine to something manageable.

But I'm still thinking that enforcing uniqueness is a waste of time.
What exactly is so harmful about it if
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
appears twice? How likely is that anyway (especially if you don't
truncate the timestamp precision)?

regards, tom lane

cjs@cynic.net

almost 23 years ago

In reply to: Ryan Bradetich (#1)

Re: Questions about indexes?

On Mon, 16 Feb 2003, Ryan Bradetich wrote:

I am not sure why all the data is duplicated in the index ...

Well, you have to have the full key in the index, or how would you know,
when you look at a particular index item, if it actually matches what
you're searching for?

MS SQL server does have an interesting option that would help you a lot
in this case: clustered indexes. A table may have a single clustered
index, and each leaf node of the index stores not just the key but
actually the entire row. Thus, in a case like yours, you'd store the row
only once, not twice.

Without thinking too hard about it (my usual mode of operation on this
list :-)) this could probably be implemented in postgresql. But I don't
think it would be entirely trivial, and your case is unusual enough
that I very much doubt whether it would be worth implementing to fix
that alone. It would also offer the advantage that any lookup using the
clustered index would save fetching the heap page after that as well,
but it's hard to say if the savings would be worth the work.

Since my only requirement is that the rows be unique, I have developed a
custom MD5 function in C, and created an index on the MD5 hash of the
concatanation of all the fields.

Well, that won't guarantee uniqueness, since it's perfectly possible
to have two different rows hash to the same value. (If that weren't
possible, your hash would have to contain as much information as the row
itself, and your space savings wouldn't be nearly so dramatic.)

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC

kevin@sysexperts.com

almost 23 years ago

In reply to: Curt Sampson (#9)

Re: Questions about indexes?

Curt Sampson wrote:

On Mon, 16 Feb 2003, Ryan Bradetich wrote:

Since my only requirement is that the rows be unique, I have developed a
custom MD5 function in C, and created an index on the MD5 hash of the
concatanation of all the fields.

Well, that won't guarantee uniqueness, since it's perfectly possible
to have two different rows hash to the same value. (If that weren't
possible, your hash would have to contain as much information as the row
itself, and your space savings wouldn't be nearly so dramatic.)

That's true, but even if he has 4 billion rows it drops the
probability of a duplicate down to something like one in 4 billion, so
it's probably a safe enough bet. His application doesn't require
absolute uniqueness, fortunately, so md5 works well enough in this
case.

Otherwise md5 wouldn't be a terribly good hash...

--
Kevin Brown kevin@sysexperts.com

josh@agliodbs.com

almost 23 years ago

In reply to: Kevin Brown (#10)

Re: Questions about indexes?

Ryan,

I am crossing this discussion to the PGSQL-PERFORMANCE list, which is the
proper place for it. Anyone interested, please follow us there!

Ryan Bradetich said:

the table would look like:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.
etc...

So I do not need the anomaly to be part of the index, I only need it to

I agree with you, that I would not normally add the anomally to the
index, except for the unique row requirement. Thinking about it now,
maybe I should guarentee unique rows via a check constraint...

Thanks for making me think about this in a different way!

First off, I'm not clear on why a duplicate anominaly would be necessarily
invalid, given the above. Not useful, certainly, but legitimate real data.

I realize that you have performance considerations to work with. However, I
must point out that your database design is not *at all* normalized ... and
that normalizing it might solve some of your indexing problems.

A normalized structure would look something like:

TABLE operations
id serial not null primary key,
host_id int not null,
timeoccurred timestamp not null default now(),
category varchar(5) not null,
constraint operations_unq unique (host_id, timeoccurred, category)

TABLE anominalies
id serial not null primary key,
operation_id int not null references operations(id) on delete cascade,
anominaly text

This structure would have two advantages for you:
1) the operations table would be *much* smaller than what you have now, as you
would not be repeating rows for each anominaly.
2) In neither table would you be including the anominaly text in an index ...
thus reducing index size tremendously.

As a warning, though: you may find that for insert speed the referential
integrity key is better maintained at the middleware layer. We've had some
complaints about slow FK enforcement in Postgres.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Import Notes

Resolved by subject fallback

kbottner@istation.com

almost 23 years ago

In reply to: Josh Berkus (#11)

Performance Baseline Script

Does there currently exist any kind of script that can be run on
Postgres to conduct a complete feature coverage test with varying
dataset sizes to compare performance between functionality changes?

The interest of course is to get a baseline of performance and then to
see how manipulation of internal algorithms or vacuum frequency or WAL
logs being place on a separate physical disk affect the performance of
the various features with various dataset sizes.

If not, how many people would be interested in such a script being
written?

Keith Bottner
kbottner@istation.com

"Vegetarian - that's an old Indian word meaning 'lousy hunter.'" - Andy
Rooney

josh@agliodbs.com

almost 23 years ago

In reply to: Keith Bottner (#12)

Re: Performance Baseline Script

Keith,

Does there currently exist any kind of script that can be run on
Postgres to conduct a complete feature coverage test with varying
dataset sizes to compare performance between functionality changes?

No.

If not, how many people would be interested in such a script being
written?

Just about everyon on this list, as well as Advocacy and Hackers, to
judge by the current long-running thread on the topic, which has
meandered across several lists.

-Josh Berkus

rbradetich@uswest.net

almost 23 years ago

In reply to: Josh Berkus (#11)

Re: Questions about indexes?

Josh,

Posting to the performance list as requested :) The reason I orgionally
posted to the hackers list was I was curious about the contents of the
index and how they worked.... but now this thread is more about
performance, so this list is more appropriate.

On Tue, 2003-02-18 at 10:37, Josh Berkus wrote:

Ryan,

I am crossing this discussion to the PGSQL-PERFORMANCE list, which is the
proper place for it. Anyone interested, please follow us there!

Ryan Bradetich said:

the table would look like:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.
etc...

So I do not need the anomaly to be part of the index, I only need it to

I agree with you, that I would not normally add the anomally to the
index, except for the unique row requirement. Thinking about it now,
maybe I should guarentee unique rows via a check constraint...

Thanks for making me think about this in a different way!

First off, I'm not clear on why a duplicate anominaly would be necessarily
invalid, given the above. Not useful, certainly, but legitimate real data.

Duplicate anomalies are not invalid, they are only invalid if they are
for the same system, same category, from the same compliancy report.

ie. This is bad:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.

This is ok:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.

The only reason the duplicate date would occur is if the same compliancy
report was entered into the database twice. (ie. The host did not
generate a new compliancy report, or a bug in the data stuffing script,
etc). Daniel Kalchev from the pgsql-hackers list had a good idea that I
am investigating, which is to have the archive script be responsible for
preventing duplicate entries into the database, thus the requirement for
an index to do this is eliminated.

The whole reason I decided to not allow duplicte entries into the
architve table is so when I generate reports, I do not have to put the
distinct qualifier on the queries which eliminates the sort and speeds
up the queries. The entire purpose of the index was to maintain good
data integrity in the archive tables for reporting purposes. If I can
enforce the data integrity another way (ie, data stuffer scripts, index
on md5 hash of the data, etc) then I can drop this huge index and be
happy :)

I realize that you have performance considerations to work with. However, I
must point out that your database design is not *at all* normalized ... and
that normalizing it might solve some of your indexing problems.

Please do point out these design errors! I am always interested in
learning more about normialization since I do not have any formal DBA
training, and most of my knowledge is from reading books, mailing lists,
and experimenting :)

A normalized structure would look something like:

TABLE operations
id serial not null primary key,
host_id int not null,
timeoccurred timestamp not null default now(),
category varchar(5) not null,
constraint operations_unq unique (host_id, timeoccurred, category)

TABLE anominalies
id serial not null primary key,
operation_id int not null references operations(id) on delete cascade,
anominaly text

This structure would have two advantages for you:
1) the operations table would be *much* smaller than what you have now, as you
would not be repeating rows for each anominaly.

I agree the operations table would be smaller, but you have also added
some data by breaking it up into 2 tables. You have an oid (8 bytes) +
operations.id (4 bytes) + anomalies.id (4 bytes) + operation_id (4
bytes) + tuple overhead[2]I grabed the 36 bytes overhead / tuple from this old FAQ I found at http://www.guides.sk/postgresql_docs/faq-english.html I did not look at what the current value is today. (36 bytes).

The anomalies.id and operation_id will be duplicated for all
~85 Millon rows[1]select count(anomaly) from history; count ---------- 85221799, but we did remove the host_id (4 bytes), timestamp
(8 bytes) and category (~5 bytes) ... for a savings of 9 bytes / row.

My quick estimation shows this saves ~ 730 MB in the table and the index
for a combined total of 1.46 GB (The reason the index savings in the
anomalies is not more is explain further in response to point 2.).

So to gain any space savings from breaking the tables up, the total size
of the operations table + primary index + operatoins_unq index < 1.46
GB.

The operations table contains oid (8 bytes) + host_id (4 bytes) +
timestamp (8 bytes) + category (~5 bytes) + tuple overhead[2]I grabed the 36 bytes overhead / tuple from this old FAQ I found at http://www.guides.sk/postgresql_docs/faq-english.html I did not look at what the current value is today. (36
bytes). Also since every category is duplicated in either the primary
index or the operations_unq index, the index sizes will be approximately
the size of the main table.

So 730 MB / (61 bytes) == ~ 12.5 Million rows.

A quick calculation of hosts * categories * data points show that we
could have a maximum of ~ 12 million entries[3]This was a rough calculation of a maxium, I do not believe we are at this maxium, so the space savings is most likely larger. in the operation table,
so this would save some space :)

2) In neither table would you be including the anominaly text in an index ...
thus reducing index size tremendously.

Unless I impliment Daniel's method of verifying the uniqness at the data
insertion point, I will still need to guarentee the anomaly is unique
for the given operation_id. If I mis-read the table schema, would you
please point it out to me .. I'm probably being dense :)

Also, I do not understand why the anomalies table need the id key for
the primary key. Wouldn't the operation_id and the anomaly form the
primary key? We could save 8 bytes (4 from table + 4 from index) * ~85
Million rows by removing this column.

As a warning, though: you may find that for insert speed the referential
integrity key is better maintained at the middleware layer. We've had some
complaints about slow FK enforcement in Postgres.

Thanks, I will keep this in mind. Although the inserts are usually done
in a batch job ... so interactive speed is generally not an issue ...
but faster is always better :)

Also I am curious ... With the table more nomalized, I now need to
perform a join when selecting data.... I realize there will be fewer
pages to read from disk (which is good!) when doing this join, but I
will interested to see how the join affects the interactive performance
of the queries.... something to test :)

Thanks for looking at this, Josh, and providing input. Hopefully by
explaining my figuring and thinking you can see what am I looking at ...
and point out additional flaws in my methods :) Unfortunately I still
do not think I can remove the anomaly field from the index yet, even by
normalizing the tables like you did.

I think Daniel has the correct answer by moving the unique constraint
check out into the stuffer script, or by performing some method of
indexing on a hash as I proposed at the beginning of the thread.

I have figured out a way to reduce my md5 index size in 1/2 again, and
have deveoped a method for dealing with collisions too. I am planning
on running some bench marks against the current method, with the tables
normalized as Josh suggested, and using the md5 hash I am working on. My
benchmarks will be fairly simple ... average time to insert X number of
valid inserts and average time to insert X number of duplicate inserts
along with disk space usage. If anyone is interested I am willing to
post the results to this list ... and if anyone has some other benchmark
suggestions they would like to see, feel free to let me know.

Thanks much for all the help and insight!

- Ryan

[1]: select count(anomaly) from history; count ---------- 85221799
count
----------
85221799

[2]: I grabed the 36 bytes overhead / tuple from this old FAQ I found at http://www.guides.sk/postgresql_docs/faq-english.html I did not look at what the current value is today.
at http://www.guides.sk/postgresql_docs/faq-english.html
I did not look at what the current value is today.

[3]: This was a rough calculation of a maxium, I do not believe we are at this maxium, so the space savings is most likely larger.
at this maxium, so the space savings is most likely larger.

--
Ryan Bradetich <rbradetich@uswest.net>

tgl@sss.pgh.pa.us

almost 23 years ago

In reply to: Josh Berkus (#13)

Re: Performance Baseline Script

"Josh Berkus" <josh@agliodbs.com> writes:

Keith,

Does there currently exist any kind of script that can be run on
Postgres to conduct a complete feature coverage test with varying
dataset sizes to compare performance between functionality changes?

No.

If you squint properly, OSDB (http://osdb.sourceforge.net/) might be
thought to do this, or at least be a starting point for it.

regards, tom lane

justin@postgresql.org

almost 23 years ago

In reply to: Tom Lane (#15)

Re: Performance Baseline Script

Tom Lane wrote:

"Josh Berkus" <josh@agliodbs.com> writes:

Keith,

Does there currently exist any kind of script that can be run on
Postgres to conduct a complete feature coverage test with varying
dataset sizes to compare performance between functionality changes?

No.

If you squint properly, OSDB (http://osdb.sourceforge.net/) might be
thought to do this, or at least be a starting point for it.

As a side thought, just today found out that the TPC organisation
provides all kinds of code freely for building/running the TPC-x tests.
Didn't know that before.

Sure, we can't submit results yet, but we might at least be able to run
the tests and see if anything interesting turns up. People have talked
about the TPC tests before, but not sure if anyone has really looked at
making them runnable on PostgreSQL for everyone yet.

Regards and best wishes,

Justin Clift

regards, tom lane

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

Andrew Sullivan

andrew@libertyrms.info

almost 23 years ago

In reply to: Justin Clift (#16)

Re: Performance Baseline Script

On Thu, Feb 20, 2003 at 01:45:25AM +1030, Justin Clift wrote:

As a side thought, just today found out that the TPC organisation
provides all kinds of code freely for building/running the TPC-x tests.

Yes, but be careful what you mean there.

It is _not_ a TPC test unless it is run under tightly-controlled and
-audited conditions as specified by TPC. And that effectively means
you have to pay them. In other words, it's not a TPC test unless you
can get it published by them.

That doesn't mean you can't run a test "based on the documents
provided by the TPC for test definition _x_". Just make sure you
walk on the correct side of the intellectual property line, or you'll
be hearing from their lawyers.

A

-- 
----
Andrew Sullivan                         204-4141 Yonge Street
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M2P 2A8
                                         +1 416 646 3304 x110

josh@agliodbs.com

almost 23 years ago

In reply to: Ryan Bradetich (#14)

Re: Questions about indexes?

Ryan,

Posting to the performance list as requested :) The reason I orgionally
posted to the hackers list was I was curious about the contents of the
index and how they worked.... but now this thread is more about
performance, so this list is more appropriate.

*shrug* I just figured that you didn't know about the performance list ...
also, I'm doing my bit to reduce traffic on -hackers.

Duplicate anomalies are not invalid, they are only invalid if they are
for the same system, same category, from the same compliancy report.

ie. This is bad:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.

This is ok:
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.

OK. Given the necessity of verifying anominaly uniqueness, my suggestions
below change somewhat.

Please do point out these design errors! I am always interested in
learning more about normialization since I do not have any formal DBA
training, and most of my knowledge is from reading books, mailing lists,
and experimenting :)

"Practical Issues in Database Management" and "Database Design for Mere
Mortals" are useful. Me, I learned through 5 years of doing the wrong thing
and finding out why it was wrong ...

I agree the operations table would be smaller, but you have also added
some data by breaking it up into 2 tables. You have an oid (8 bytes) +
operations.id (4 bytes) + anomalies.id (4 bytes) + operation_id (4
bytes) + tuple overhead[2] (36 bytes).

Yes, and given your level of traffic, you might have to use 8byte id fields.
But if disk space is your main issue, then I'd suggest swaping the category
code to a "categories" table, allowing you to use an int4 or even and int2 as
the category_id in the Operations table. This would save you 6-8 bytes per
row in Operations.

2) In neither table would you be including the anominaly text in an index
... thus reducing index size tremendously.

Unless I impliment Daniel's method of verifying the uniqness at the data
insertion point, I will still need to guarentee the anomaly is unique
for the given operation_id. If I mis-read the table schema, would you
please point it out to me .. I'm probably being dense :)

Also, I do not understand why the anomalies table need the id key for
the primary key. Wouldn't the operation_id and the anomaly form the
primary key? We could save 8 bytes (4 from table + 4 from index) * ~85
Million rows by removing this column.

As I said above, I didn't understand why you needed to check anominaly
uniqueness. Given that you do, I'd suggest that you do the above.

Of course, you also have another option. You could check uniqueness on the
operation_id and the md5 of the anominaly field. This would be somewhat
awkward, and would still require that you have a seperate primary key for the
anominalies table. But the difference between an index on an up-to-1024
character field and a md5 string might make it worth it, particularly when it
comes to inserting new rows.

In other words, test:
1) drop the anominaly_id as suggested, above.
2) adding an anominaly_md5 column to the anominalies table.
3) make operation_id, anominaly_md5 your primary key
4) write a BEFORE trigger that caclulates the md5 of any incoming anominalies
and adds it to that column.

It's worth testing, since a unique index on a 1024-character field for 85
million records could be very slow.

Thanks, I will keep this in mind. Although the inserts are usually done
in a batch job ... so interactive speed is generally not an issue ...
but faster is always better :)

In a transaction, I hope.

Also I am curious ... With the table more nomalized, I now need to
perform a join when selecting data.... I realize there will be fewer
pages to read from disk (which is good!) when doing this join, but I
will interested to see how the join affects the interactive performance
of the queries.... something to test :)

I'll look forward to seeing the results of your test.

If anyone is interested I am willing to
post the results to this list ... and if anyone has some other benchmark
suggestions they would like to see, feel free to let me know.

We're always interested in benchmarks.

--
Josh Berkus
Aglio Database Solutions
San Francisco

justin@postgresql.org

almost 23 years ago

In reply to: Andrew Sullivan (#17)

Re: Performance Baseline Script

Andrew Sullivan wrote:

On Thu, Feb 20, 2003 at 01:45:25AM +1030, Justin Clift wrote:

As a side thought, just today found out that the TPC organisation
provides all kinds of code freely for building/running the TPC-x tests.

Yes, but be careful what you mean there.

It is _not_ a TPC test unless it is run under tightly-controlled and
-audited conditions as specified by TPC. And that effectively means
you have to pay them. In other words, it's not a TPC test unless you
can get it published by them.

That doesn't mean you can't run a test "based on the documents
provided by the TPC for test definition _x_". Just make sure you
walk on the correct side of the intellectual property line, or you'll
be hearing from their lawyers.

Good point.

What I was thinking about was that we could likely get the "test suite"
of code that the TPC publishes and ensure that it works with PostgreSQL.

The reason I'm thinking of is that if any of the existing PostgreSQL
support companies (or an alliance of them), decided to become a member
of the TPC in order to submit results then the difficuly of entry would
be that bit lower, and there would be people around at that stage who'd
already gained good experience with the test suite(s).

:-)

Regards and best wishes,

Justin Clift

A

--
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."
- Indira Gandhi

cjs@cynic.net

almost 23 years ago

In reply to: Ryan Bradetich (#14)

Re: Questions about indexes?

On Wed, 19 Feb 2003, Ryan Bradetich wrote:

1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user x has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has an invalid shell.
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | user y has expired password.
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | file /foo has improper owner.

If you're going to normalize this a bit, you should start looking at
the data that are repeated and trying to get rid of the repititions.
First of all, the timestamp is repeated a lot, you might move that to a
separate table and just use a key into that table. But you might even
do better with multiple columns: combine the timestamp and host ID into
one table to get a "host report instance" and replace both those columns
with just that. If host-id/timestamp/category triplets are frequently
repeated, you might even consider combining the three into another
table, and just using an ID from that table with each anomaly.

But the biggest space and time savings would come from normalizing your
anomalys themselves, because there's a huge amount repeated there. If you're
able to change the format to something like:

invalid shell for user: x
invalid shell for user: y
expired password for user: y
improper owner for file: /foo

You can split those error messages off into another table:

anomaly_id | anomaly
-----------+------------------------------------------------
1 | invalid shell for user
2 | expired password for user
3 | improper owner for file

And now your main table looks like this:

host_id | timestamp | ctgr | anomaly_id | datum
--------+------------------------------+------+------------+------
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | 1 | x
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | 1 | y
1 | Mon Feb 17 00:34:24 MST 2003 | p101 | 2 | y
2 | Mon Feb 17 00:34:24 MST 2003 | f101 | 3 | /foo

cjs
--
Curt Sampson <cjs@cynic.net> +81 90 7737 2974 http://www.netbsd.org
Don't you know, in this new Dark Age, we're all light. --XTC