postgreSQL for storing a database of documents
Is PostgreSQL appropriate for a database of documents? I see several
potential problems.
1. Tuples are limited to 8KB. I assume even the "text" type is limited that
way. But my documents may be longer than that.
2. I need to be able to search by content. I can use a regular-expression
search in a SELECT ... WHERE clause, but I expect that to be exceedingly
inefficient. I would prefer a GLIMPSE-like search ability.
3. I will also be conducting searches by some other fields (like author
or call number). But only one field can be a primary key. Will selection by
other fields be terribly inefficient for large numbers (about 200K) of large
tuples (ranging from about 100B to say 10KB)?
If you can help, please respond not only to this mailing list but also
directly to me. Thanks.
Raphael Finkel <raphael@cs.uky.edu>
On Tue, 4 May 1999, Raphael Finkel wrote:
# Is PostgreSQL appropriate for a database of documents? I see several
# potential problems.
#
# 1. Tuples are limited to 8KB. I assume even the "text" type is limited
# that way. But my documents may be longer than that.
This isn't a problem, I store all of my photographs in postgres.
# 2. I need to be able to search by content. I can use a
# regular-expression search in a SELECT ... WHERE clause, but I expect
# that to be exceedingly inefficient. I would prefer a GLIMPSE-like
# search ability.
This is a problem, unless you use one of the keyword index things
people have been talking about. I prefer a glimpse or such in this
situation.
# 3. I will also be conducting searches by some other fields (like author
# or call number). But only one field can be a primary key. Will
# selection by other fields be terribly inefficient for large numbers
# (about 200K) of large tuples (ranging from about 100B to say 10KB)?
Primary key shouldn't be any of these fields, but you can index on
whatever you want. This is where the RDBMS is useful, as you can
instantly get all documents by a given author and/or on a given date range
and/or a given topic.
--
SA, beyond.com My girlfriend asked me which one I like better.
pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
| Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE
L_______________________ I hope the answer won't upset her. ____________
Ari Halberstadt wrote:
Raphael Finkel <raphael@cs.uky.edu> wrote:
Is PostgreSQL appropriate for a database of documents? I see several
potential problems.1. Tuples are limited to 8KB. I assume even the "text" type is limited that
way. But my documents may be longer than that.This would pretty much nix the idea of using postgres for a document
repository. You'd need to split the data into separate rows in another
table. I've done this for my bulletin board (though most posts should be
less than 8K in my case), but I think it would be even more of a hastle
with documents. What you can do instead is store descriptive information in
the DB and store the actual documents in the file system with a field in
the document table containing the path to the file. A few years ago I
worked on a commercial document management program that stored attributes
in an Oracle database but kept the actual documents on disk.
That a good solution, but now that XML is starting to break off it kinda puts
PostgreSQL users at a disadvantage. We have to have static XML documents on on
disk in order to use the format and have external search engines to search for
content in the documents. It is not that big of a headache, but is an
inconvenience. I will be doing some work on XML soon for my job and they are
going straight Oracle because they want to create "searchable" XML documents on
the fly to post and distribute on the web. Very LARGE documents. Like the legal
code for NYC and such. When Postgre starts using large objects it is going to
make a lot of noise. Especially for "start-up" companies.
--
Mark Fleming Student ICQ# 16171200
e-mail: mtony@iname.com
http://www.globaldrum.com/mfleming
http://coolbox.bowiestate.edu/~mfleming
Import Notes
Reference msg id not found: v04003a03b354ecef597e@192.168.1.2 | Resolved by subject fallback
Raphael Finkel <raphael@cs.uky.edu> wrote:
Is PostgreSQL appropriate for a database of documents? I see several
potential problems.1. Tuples are limited to 8KB. I assume even the "text" type is limited that
way. But my documents may be longer than that.
This would pretty much nix the idea of using postgres for a document
repository. You'd need to split the data into separate rows in another
table. I've done this for my bulletin board (though most posts should be
less than 8K in my case), but I think it would be even more of a hastle
with documents. What you can do instead is store descriptive information in
the DB and store the actual documents in the file system with a field in
the document table containing the path to the file. A few years ago I
worked on a commercial document management program that stored attributes
in an Oracle database but kept the actual documents on disk.
2. I need to be able to search by content. I can use a regular-expression
search in a SELECT ... WHERE clause, but I expect that to be exceedingly
inefficient. I would prefer a GLIMPSE-like search ability.
If it's stored outside the database then you can use any kind of text
searching tool. There's a full text add-on for postgres in
contrib/fulltextindex which I've been meaning to install. You could use
this to index all of the attributes so people can do full text searches on
things like comments and descriptions. When presenting the search results
you could merge the results of searching the files with the results of
searching in the text fields.
3. I will also be conducting searches by some other fields (like author
or call number). But only one field can be a primary key. Will selection by
other fields be terribly inefficient for large numbers (about 200K) of large
tuples (ranging from about 100B to say 10KB)?
You can create as many indexes as you need using create index.
-- Ari Halberstadt mailto:ari@shore.net <http://www.magiccookie.com/>
PGP public key available at <http://www.magiccookie.com/pgpkey.txt>
Dustin Sallings wrote:
On Tue, 4 May 1999, Raphael Finkel wrote:
# Is PostgreSQL appropriate for a database of documents? I see several
# potential problems.
#
# 1. Tuples are limited to 8KB. I assume even the "text" type is limited
# that way. But my documents may be longer than that.This isn't a problem, I store all of my photographs
in postgres.
How?
On Tue, 4 May 1999, Mark Fleming wrote:
// That a good solution, but now that XML is starting to break off it
// kinda puts PostgreSQL users at a disadvantage. We have to have
// static XML documents on on disk in order to use the format and have
// external search engines to search for content in the documents. It
// is not that big of a headache, but is an inconvenience. I will be
// doing some work on XML soon for my job and they are going straight
// Oracle because they want to create "searchable" XML documents on the
// fly to post and distribute on the web. Very LARGE documents. Like
// the legal code for NYC and such. When Postgre starts using large
// objects it is going to make a lot of noise. Especially for
// "start-up" companies.
What do you mean when Postgres starts using large objects? It has
for quite some time now. I chose not to use it because I don't like the
way the data is stored on disk, though.
--
Principal Member Technical Staff, beyond.com The world is watching America,
pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
| Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE
L______________________________________________ and America is watching TV. __
On Wed, 5 May 1999, Chris Bitmead wrote:
// > # 1. Tuples are limited to 8KB. I assume even the "text" type is limited
// > # that way. But my documents may be longer than that.
// >
// > This isn't a problem, I store all of my photographs
// > in postgres.
//
// How?
I played with BLOBs, but I didn't like how they were stored, or
that they didn't back up, so I went with Base64 encoded data which I split
up to store. It's pretty damned quick, and I get a lot more control.
--
Principal Member Technical Staff, beyond.com The world is watching America,
pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
| Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98 D8 BE
L______________________________________________ and America is watching TV. __
Hi,
I'm using Large Object to store images and displayed them on the fly. I never found
any problems and it
is very quiet. "When Postgre starts using large objects it is going to make a lot
of noise." ???
You can also generate XML on the fly or what you want with Perl or other CGI stuff,
simply using all your
differents fields.
Concerning the back-up of BLOB, I have a reference for each object and export each
of them in a crontab
with their reference as name. Of course OID are lost but you really don't need them
! All works fine again !
I have done a Perl script for regexp search into all text fields stored in my
database from a www interface.
It's very simple but it also works fine. I do a AND search with all given keyword
and if nothing is found I
do a OR search. This not very powerfull but I can do better the day I win some time
:-) I don't know about
search into BLOB.
So, I have thousand of hit each month on a P90 with 16 Mo of RAM (really not
enought) it's slow but
for 6 months now I didn't found any problem and my server log file is empty.
I'm doing a little modification now because all images stored as Large Object are
not cached by proxies
so it's very expansive for my bandwich (64Ko). So I only store the path of the
image and save the images
as normal files.
Gilles.
darold wrote:
I'm using Large Object to store images and displayed them on the fly. I never found
any problems and it
is very quiet. "When Postgre starts using large objects it is going to make a lot
of noise." ???
How many images are you storing?
--
Chris Bitmead
http://www.bigfoot.com/~chris.bitmead
mailto:chris.bitmead@bigfoot.com
Your e-mail did not arrive at its intended destination. You need to
send it to Michael J. Davis, not Michael Davis.
From: Mark Fleming <mfleming @ coolbox.bowiestate.edu> on 05/04/99
02:23 PM
To: pgsql-general @ postgreSQL.org@SMTP@EXCHANGE
cc:
Subject: Re: [GENERAL] postgreSQL for storing a database of
documents
Ari Halberstadt wrote:
Raphael Finkel <raphael@cs.uky.edu> wrote:
Is PostgreSQL appropriate for a database of documents? I see
several
potential problems.
1. Tuples are limited to 8KB. I assume even the "text" type is
limited that
way. But my documents may be longer than that.
This would pretty much nix the idea of using postgres for a
document
repository. You'd need to split the data into separate rows in
another
table. I've done this for my bulletin board (though most posts
should be
less than 8K in my case), but I think it would be even more of a
hastle
with documents. What you can do instead is store descriptive
information in
the DB and store the actual documents in the file system with a
field in
the document table containing the path to the file. A few years
ago I
worked on a commercial document management program that stored
attributes
in an Oracle database but kept the actual documents on disk.
That a good solution, but now that XML is starting to break off it
kinda puts
PostgreSQL users at a disadvantage. We have to have static XML
documents on on
disk in order to use the format and have external search engines to
search for
content in the documents. It is not that big of a headache, but is
an
inconvenience. I will be doing some work on XML soon for my job and
they are
going straight Oracle because they want to create "searchable" XML
documents on
the fly to post and distribute on the web. Very LARGE documents.
Like the legal
code for NYC and such. When Postgre starts using large objects it
is going to
make a lot of noise. Especially for "start-up" companies.
--
Mark Fleming Student ICQ# 16171200
e-mail: mtony@iname.com
http://www.globaldrum.com/mfleming
http://coolbox.bowiestate.edu/~mfleming
Import Notes
Resolved by subject fallback
Your e-mail did not arrive at its intended destination. You need to
send it to Michael J. Davis, not Michael Davis
From: dustin sallings <dustin @ spy.net> on 05/05/99 01:57 AM
To: mtony @ iname.com@SMTP@EXCHANGE
cc: pgsql-general @ postgreSQL.org@SMTP@EXCHANGE
Subject: Re: [GENERAL] postgreSQL for storing a database of
documents
On Tue, 4 May 1999, Mark Fleming wrote:
// That a good solution, but now that XML is starting to break off
it
// kinda puts PostgreSQL users at a disadvantage. We have to have
// static XML documents on on disk in order to use the format and
have
// external search engines to search for content in the documents.
It
// is not that big of a headache, but is an inconvenience. I will
be
// doing some work on XML soon for my job and they are going
straight
// Oracle because they want to create "searchable" XML documents on
the
// fly to post and distribute on the web. Very LARGE documents.
Like
// the legal code for NYC and such. When Postgre starts using large
// objects it is going to make a lot of noise. Especially for
// "start-up" companies.
What do you mean when Postgres starts using large objects?
It has
for quite some time now. I chose not to use it because I don't like
the
way the data is stored on disk, though.
--
Principal Member Technical Staff, beyond.com The world is
watching America,
pub 1024/3CAE01D5 1994/11/03 Dustin Sallings <dustin@spy.net>
| Key fingerprint = 87 02 57 08 02 D0 DA D6 C8 0F 3E 65 51 98
D8 BE
L______________________________________________ and America is
watching TV. __
Import Notes
Resolved by subject fallback
Hi,
Rigth now I have less than one thousand of gif/jpg around 6 to 15 ko each. I guess I can
go upper :-))
regards,
Gilles Darold
Chris Bitmead wrote:
Show quoted text
darold wrote:
I'm using Large Object to store images and displayed them on the fly. I never found
any problems and it
is very quiet. "When Postgre starts using large objects it is going to make a lot
of noise." ???How many images are you storing?
--
Chris Bitmead
http://www.bigfoot.com/~chris.bitmead
mailto:chris.bitmead@bigfoot.com