Replacement for Oracle Text

Started by Daniel Westermann (DWE)over 10 years ago15 messagesgeneral

daniel.westermann@dbi-services.com

over 10 years ago

Hi,

if I'd need to implement/replace Oracle Text (ww.oracle.com/technetwork/testcontent/index-098492.html). What choices do I have in PostgreSQL (9.5+) ?

Regards
Daniel

Thomas Kellerer

spam_eater@gmx.net

over 10 years ago

In reply to: Daniel Westermann (DWE) (#1)

Re: Replacement for Oracle Text

Daniel Westermann schrieb am 19.02.2016 um 11:53:

if I'd need to implement/replace Oracle Text (ww.oracle.com/technetwork/testcontent/index-098492.html).
What choices do I have in PostgreSQL (9.5+) ?

Postgres also has a full text search (which I find much easier to use than Oracle's):

http://www.postgresql.org/docs/current/static/textsearch.html

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Daniel Westermann (DWE)

daniel.westermann@dbi-services.com

over 10 years ago

In reply to: Thomas Kellerer (#2)

Re: Replacement for Oracle Text

Daniel Westermann schrieb am 19.02.2016 um 11:53:

if I'd need to implement/replace Oracle Text (ww.oracle.com/technetwork/testcontent/index-098492.html).

What choices do I have in PostgreSQL (9.5+) ?

Postgres also has a full text search (which I find much easier to use than Oracle's):

http://www.postgresql.org/docs/current/static/textsearch.html

Yes, i have seen this. Can this be used to index and search binary documents, e.g. pdf ?

Thomas Kellerer

spam_eater@gmx.net

over 10 years ago

In reply to: Daniel Westermann (DWE) (#3)

Re: Replacement for Oracle Text

Daniel Westermann schrieb am 19.02.2016 um 12:41:

if I'd need to implement/replace Oracle Text (ww.oracle.com/technetwork/testcontent/index-098492.html).

What choices do I have in PostgreSQL (9.5+) ?

Postgres also has a full text search (which I find much easier to use than Oracle's):

http://www.postgresql.org/docs/current/static/textsearch.html

Yes, i have seen this. Can this be used to index and search binary documents, e.g. pdf ?

Ah, no. That's not possible

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Andreas Joseph Krogh

andreas@visena.com

over 10 years ago

In reply to: Daniel Westermann (DWE) (#3)

Re: Replacement for Oracle Text

På fredag 19. februar 2016 kl. 12:41:49, skrev Daniel Westermann <
daniel.westermann@dbi-services.com <mailto:daniel.westermann@dbi-services.com>>:

Daniel Westermann schrieb am 19.02.2016 um 11:53:

if I'd need to implement/replace Oracle Text

(ww.oracle.com/technetwork/testcontent/index-098492.html).

What choices do I have in PostgreSQL (9.5+) ?

Postgres also has a full text search (which I find much easier to use than

Oracle's):

http://www.postgresql.org/docs/current/static/textsearch.html

Yes, i have seen this. Can this be used to index and search binary documents,
e.g. pdf ?

What we do is extract plain-text from PFD/Word etc. clientside in the
application, and then index that in the database.
Works very well.

-- Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
andreas@visena.com <mailto:andreas@visena.com>
www.visena.com <https://www.visena.com>
<https://www.visena.com>

Simon Riggs

simon@2ndQuadrant.com

over 10 years ago

In reply to: Thomas Kellerer (#4)

Re: Replacement for Oracle Text

On 19 February 2016 at 11:46, Thomas Kellerer <spam_eater@gmx.net> wrote:

Daniel Westermann schrieb am 19.02.2016 um 12:41:

if I'd need to implement/replace Oracle Text (

ww.oracle.com/technetwork/testcontent/index-098492.html).

What choices do I have in PostgreSQL (9.5+) ?

Postgres also has a full text search (which I find much easier to use

than Oracle's):

http://www.postgresql.org/docs/current/static/textsearch.html

Yes, i have seen this. Can this be used to index and search binary

documents, e.g. pdf ?

Ah, no. That's not possible

...not possible, Yet.

PostgreSQL grows by adding the features people need and its changing
rapidly.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Simon Riggs (#6)

Re: Replacement for Oracle Text

On Fri, Feb 19, 2016 at 11:53:26AM +0000, Simon Riggs wrote:

On 19 February 2016 at 11:46, Thomas Kellerer <spam_eater@gmx.net> wrote:

Daniel Westermann schrieb am 19.02.2016 um 12:41:

if I'd need to implement/replace Oracle Text (ww.oracle.com/

technetwork/testcontent/index-098492.html).

What choices do I have in PostgreSQL (9.5+) ?

Postgres also has a full text search (which I find much easier to use

than Oracle's):

http://www.postgresql.org/docs/current/static/textsearch.html

Yes, i have seen this. Can this be used to index and search binary

documents, e.g. pdf ?

Ah, no. That's not possible

...not possible, Yet.

PostgreSQL grows by adding the features people need and its changing rapidly.ï¿½

I wonder if PLPerl could be used to extract the words from a PDF
document and create a tsvector column from it.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Sándor Daku

daku.sandor@gmail.com

over 10 years ago

In reply to: Bruce Momjian (#7)

Re: Replacement for Oracle Text

On 19 February 2016 at 14:19, Bruce Momjian <bruce@momjian.us> wrote:

On Fri, Feb 19, 2016 at 11:53:26AM +0000, Simon Riggs wrote:

On 19 February 2016 at 11:46, Thomas Kellerer <spam_eater@gmx.net>

wrote:

Daniel Westermann schrieb am 19.02.2016 um 12:41:

if I'd need to implement/replace Oracle Text (ww.oracle.com/

technetwork/testcontent/index-098492.html).

What choices do I have in PostgreSQL (9.5+) ?

Postgres also has a full text search (which I find much easier to

use

than Oracle's):

http://www.postgresql.org/docs/current/static/textsearch.html

Yes, i have seen this. Can this be used to index and search binary

documents, e.g. pdf ?

Ah, no. That's not possible

...not possible, Yet.

PostgreSQL grows by adding the features people need and its changing

rapidly.

I wonder if PLPerl could be used to extract the words from a PDF
document and create a tsvector column from it.

I don't know about PLPerl(I'm pretty sure it could be used for this
purpose, though.). On the other hand I've written code for this in Python
which should be easy to adapt for PLPython, if necessary.

Ezt az e-mailt egy Avast védelemmel rendelkező, vírusmentes számítógépről
küldték.
www.avast.com <https://www.avast.com/sig-email>
<#DDB4FAA8-2DD7-40BB-A1B8-4E2AA1F9FDF2>

Bruce Momjian

bruce@momjian.us

over 10 years ago

In reply to: Sándor Daku (#8)

Re: Replacement for Oracle Text

On Fri, Feb 19, 2016 at 02:49:16PM +0100, s d wrote:

On 19 February 2016 at 14:19, Bruce Momjian <bruce@momjian.us> wrote:

ï¿½ ï¿½ ï¿½Ah, no. That's not possible

...not possible, Yet.

PostgreSQL grows by adding the features people need and its changing

rapidly.ï¿½

I wonder if PLPerl could be used to extract the words from a PDF
document and create a tsvector column from it.

ï¿½I don't know about PLPerl(I'm pretty sure it could be used for this purpose,
though.).ï¿½ On the other hand I've written code for this in Python which should
be easy to adapt for PLPython, if necessary.

Right, so you would write a PL/Perl or PL/Python trigger function that
would populate the tsvector column on every INSERT or UPDATE.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+ Roman grave inscription                             +

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#10

Daniel Westermann (DWE)

daniel.westermann@dbi-services.com

over 10 years ago

In reply to: Bruce Momjian (#9)

Re: Replacement for Oracle Text

I don't know about PLPerl(I'm pretty sure it could be used for this purpose,
though.). On the other hand I've written code for this in Python which should
be easy to adapt for PLPython, if necessary.

Right, so you would write a PL/Perl or PL/Python trigger function that
would populate the tsvector column on every INSERT or UPDATE.

Thanks to all for your input
Daniel

#11

Josh Berkus

josh@agliodbs.com

over 10 years ago

In reply to: Daniel Westermann (DWE) (#1)

Re: Replacement for Oracle Text

On 02/19/2016 05:49 AM, s d wrote:

On 19 February 2016 at 14:19, Bruce Momjian <bruce@momjian.us
<mailto:bruce@momjian.us>> wrote:

I wonder if PLPerl could be used to extract the words from a PDF
document and create a tsvector column from it.

I don't know about PLPerl(I'm pretty sure it could be used for this
purpose, though.). On the other hand I've written code for this in
Python which should be easy to adapt for PLPython, if necessary.

I'd swear someone already built something to do this. All you need is a
library which reads PDF and transforms it into text, and then you can
FTS it. I know there's a module for OpenOffice docs somewhere as well,
but heck if I can remember where.

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Import Notes

Reply to msg id not found: WM!0baa9faec3016d0bea040b4833bbf7879615dec8a23b205a9e8507d396bf1528e2b0d260035eb7f2677531ac5cfce723!@asav-3.01.com

#12

Oleg Bartunov

oleg@sai.msu.su

over 10 years ago

In reply to: Josh Berkus (#11)

Re: Replacement for Oracle Text

On Fri, Feb 19, 2016 at 8:28 PM, Josh berkus <josh@agliodbs.com> wrote:

On 02/19/2016 05:49 AM, s d wrote:

On 19 February 2016 at 14:19, Bruce Momjian <bruce@momjian.us
<mailto:bruce@momjian.us>> wrote:

I wonder if PLPerl could be used to extract the words from a PDF
document and create a tsvector column from it.

I don't know about PLPerl(I'm pretty sure it could be used for this
purpose, though.). On the other hand I've written code for this in
Python which should be easy to adapt for PLPython, if necessary.

I'd swear someone already built something to do this. All you need is a
library which reads PDF and transforms it into text, and then you can FTS
it. I know there's a module for OpenOffice docs somewhere as well, but
heck if I can remember where.

I used pdftotext for that.
I think it'd be useful to have extension{s}, which can be used to convert
anything to text. I remember someone indexed chemical formulae, TeX/LaTeX,
DOC files.

Show quoted text

--
--
Josh Berkus
Red Hat OSAS
(any opinions are my own)

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#13

Stephen Davies

sdavies@sdc.com.au

over 10 years ago

In reply to: Bruce Momjian (#9)

Re: Replacement for Oracle Text

On 20/02/16 00:24, Bruce Momjian wrote:

On Fri, Feb 19, 2016 at 02:49:16PM +0100, s d wrote:

On 19 February 2016 at 14:19, Bruce Momjian <bruce@momjian.us> wrote:

Ah, no. That's not possible

...not possible, Yet.

PostgreSQL grows by adding the features people need and its changing

rapidly.

I wonder if PLPerl could be used to extract the words from a PDF
document and create a tsvector column from it.

I don't know about PLPerl(I'm pretty sure it could be used for this purpose,
though.). On the other hand I've written code for this in Python which should
be easy to adapt for PLPython, if necessary.

Right, so you would write a PL/Perl or PL/Python trigger function that
would populate the tsvector column on every INSERT or UPDATE.

FWIW, I just use pdftotext in my CGI.

--
=============================================================================
Stephen Davies Consulting P/L Phone: 08-8177 1595
Adelaide, South Australia. Mobile:040 304 0583

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

#14

Chris Travers

chris.travers@gmail.com

over 10 years ago

In reply to: Stephen Davies (#13)

Re: Replacement for Oracle Text

A more general way would be to have a function which takes a pdf in and
returns the text. Mark it immutable.

Then you can index the output of converting that text to a tsvector.

You may want to pull everything into a tsvector column for ease of review,
but functional indexes also make that less important

On Sat, Feb 20, 2016 at 1:10 AM, Stephen Davies <sdavies@sdc.com.au> wrote:

On 20/02/16 00:24, Bruce Momjian wrote:

On Fri, Feb 19, 2016 at 02:49:16PM +0100, s d wrote:

On 19 February 2016 at 14:19, Bruce Momjian <bruce@momjian.us> wrote:

Ah, no. That's not possible

...not possible, Yet.

PostgreSQL grows by adding the features people need and its

changing
rapidly.

I wonder if PLPerl could be used to extract the words from a PDF
document and create a tsvector column from it.

I don't know about PLPerl(I'm pretty sure it could be used for this
purpose,
though.). On the other hand I've written code for this in Python which
should
be easy to adapt for PLPython, if necessary.

Right, so you would write a PL/Perl or PL/Python trigger function that
would populate the tsvector column on every INSERT or UPDATE.

FWIW, I just use pdftotext in my CGI.

--

=============================================================================
Stephen Davies Consulting P/L Phone: 08-8177
1595
Adelaide, South Australia. Mobile:040 304
0583

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor
lock-in.
http://www.efficito.com/learn_more

#15

Stephen Davies

sdavies@sdc.com.au

over 10 years ago

In reply to: Chris Travers (#14)

Re: Replacement for Oracle Text

On 20/02/16 16:21, Chris Travers wrote:

A more general way would be to have a function which takes a pdf in and
returns the text. Mark it immutable.

Then you can index the output of converting that text to a tsvector.

You may want to pull everything into a tsvector column for ease of review, but
functional indexes also make that less important

On Sat, Feb 20, 2016 at 1:10 AM, Stephen Davies <sdavies@sdc.com.au
<mailto:sdavies@sdc.com.au>> wrote:

On 20/02/16 00:24, Bruce Momjian wrote:

On Fri, Feb 19, 2016 at 02:49:16PM +0100, s d wrote:

On 19 February 2016 at 14:19, Bruce Momjian <bruce@momjian.us
<mailto:bruce@momjian.us>> wrote:

Ah, no. That's not possible

...not possible, Yet.

PostgreSQL grows by adding the features people need and

its changing
rapidly.

I wonder if PLPerl could be used to extract the words from a PDF
document and create a tsvector column from it.

I don't know about PLPerl(I'm pretty sure it could be used for
this purpose,
though.). On the other hand I've written code for this in Python
which should
be easy to adapt for PLPython, if necessary.

Right, so you would write a PL/Perl or PL/Python trigger function that
would populate the tsvector column on every INSERT or UPDATE.

FWIW, I just use pdftotext in my CGI.

--
=============================================================================
Stephen Davies Consulting P/L Phone: 08-8177 1595
Adelaide, South Australia. Mobile:040 304 0583

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org
<mailto:pgsql-general@postgresql.org>)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
Best Wishes,
Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.
http://www.efficito.com/learn_more

I reckon my approach is simpler and easier (given web-based data entry).
I get all the meta data plus the PDF BLOB in one HTML request, get out the
text and do the insert and all indexing including the tsvector in one PG request.
It also makes is easier to handle BLOB types other than PDF in the same CGI
script as I just include the extracted text in the PG request.
There are readily callable text extraction utilities similar to pdftotext for
all BLOB types that I see.

With a function, I would have to have separate functions or an extra BLOB-type
parameter to the function and separate extraction logic in the function.

--
=============================================================================
Stephen Davies Consulting P/L Phone: 08-8177 1595
Adelaide, South Australia. Mobile:040 304 0583

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general