tsearch2 and pdf files

Started by philip johnsonover 19 years ago7 messagesgeneral
Jump to latest
#1philip johnson
philip.johnson@atempo.com

I'm using Postgresql 8.1.5

Tsearch2 is installed and runs well

I'd like to use tsearch2 to index PDF files.

Do someone has a detailed process to implement that?

#2Hannes Dorbath
light@theendofthetunnel.de
In reply to: philip johnson (#1)
Re: tsearch2 and pdf files

You just need software that extracts the text from it. Search google for
pdf2txt and others. Printer drivers that try to get text from anything
are available as well.

On 11.12.2006 11:41, Philip Johnson wrote:

I'm using Postgresql 8.1.5

Tsearch2 is installed and runs well

I'd like to use tsearch2 to index PDF files.

Do someone has a detailed process to implement that?

--
Regards,
Hannes Dorbath

#3philip johnson
philip.johnson@atempo.com
In reply to: Hannes Dorbath (#2)
Re: tsearch2 and pdf files

Do you know what kind of table should I use ?
Is there a shell script or a php script that does the work ?

regards

Show quoted text

-----Message d'origine-----
De : pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] De la part de Hannes Dorbath
Envoyé : lundi 11 décembre 2006 12:21
À : pgsql-general@postgresql.org
Objet : Re: [GENERAL] tsearch2 and pdf files

You just need software that extracts the text from it. Search google for
pdf2txt and others. Printer drivers that try to get text from anything
are available as well.

On 11.12.2006 11:41, Philip Johnson wrote:

I'm using Postgresql 8.1.5

Tsearch2 is installed and runs well

I'd like to use tsearch2 to index PDF files.

Do someone has a detailed process to implement that?

--
Regards,
Hannes Dorbath

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

#4Henrik Zagerholm
henke@mac.se
In reply to: philip johnson (#3)
Re: tsearch2 and pdf files

1. Convert PDF to file with e.g xpdf
2. Insert parsed text to a table of your choice.
3. Make vectors from the text.

Cheers,

11 dec 2006 kl. 18:23 skrev Philip Johnson:

Show quoted text

Do you know what kind of table should I use ?
Is there a shell script or a php script that does the work ?

regards

-----Message d'origine-----
De : pgsql-general-owner@postgresql.org [mailto:pgsql-general-
owner@postgresql.org] De la part de Hannes Dorbath
Envoyé : lundi 11 décembre 2006 12:21
À : pgsql-general@postgresql.org
Objet : Re: [GENERAL] tsearch2 and pdf files

You just need software that extracts the text from it. Search
google for
pdf2txt and others. Printer drivers that try to get text from
anything
are available as well.

On 11.12.2006 11:41, Philip Johnson wrote:

I'm using Postgresql 8.1.5

Tsearch2 is installed and runs well

I'd like to use tsearch2 to index PDF files.

Do someone has a detailed process to implement that?

--
Regards,
Hannes Dorbath

---------------------------(end of
broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

---------------------------(end of
broadcast)---------------------------
TIP 4: Have you searched our list archives?

http://archives.postgresql.org/

#5Magnus Hagander
magnus@hagander.net
In reply to: Henrik Zagerholm (#4)
Re: tsearch2 and pdf files

1. Convert PDF to file with e.g xpdf
2. Insert parsed text to a table of your choice.
3. Make vectors from the text.

Actually, if you're not going to use the headline() function, you cna
just store it directly in a vector, cutting down on the size
requirements. Just insert to the to_tsvector() result. The full text is
required for headline() though, so you can't cheat on that.

//Magnus

#6philip johnson
philip.johnson@atempo.com
In reply to: Magnus Hagander (#5)
Re: tsearch2 and pdf files

1. Convert PDF to file with e.g xpdf
2. Insert parsed text to a table of your choice.
3. Make vectors from the text.

Actually, if you're not going to use the headline() function, you cna
just store it directly in a vector, cutting down on the size
requirements.

What size requirements ?

Show quoted text

Just insert to the to_tsvector() result. The full text is
required for headline() though, so you can't cheat on that.

//Magnus

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

#7Magnus Hagander
magnus@hagander.net
In reply to: philip johnson (#6)
Re: tsearch2 and pdf files

1. Convert PDF to file with e.g xpdf
2. Insert parsed text to a table of your choice.
3. Make vectors from the text.

Actually, if you're not going to use the headline()

function, you cna

just store it directly in a vector, cutting down on the size
requirements.

What size requirements ?

If you store both text and tsvector, that's going to use up a lot more
space than if you just store the tsvector. With a proper lexer and such,
it will be *more* than twice as large, given that the tsvector will be
smaller than the text.

//Magnus