Html parsing and inline elements

Started by Marcelo Zabaniabout 10 years ago7 messageshackers

mzabani@gmail.com

about 10 years ago

Hi everyone,

I was here wondering whether HTML parsing should separate tokens that are
not separated by spaces in the original text, but are separated by an
inline element. Let me show you an example:

*SELECT to_tsvector('english', 'Helloneighbor, you are
nice')*
*Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*

"Hello" and "neighbor" should really be separated, because ** is a block
element, but "nice" should be a single word there, since there is no visual
separation when rendered (** and ** are inline elements).

Sorry if this has been asked before, but I couldn't find it anywhere.

Thanks in advance,
Marcelo.

Tom Lane

tgl@sss.pgh.pa.us

about 10 years ago

In reply to: Marcelo Zabani (#1)

Re: Html parsing and inline elements

Marcelo Zabani <mzabani@gmail.com> writes:

I was here wondering whether HTML parsing should separate tokens that are
not separated by spaces in the original text, but are separated by an
inline element. Let me show you an example:

*SELECT to_tsvector('english', 'Helloneighbor, you are
nice')*
*Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*

"Hello" and "neighbor" should really be separated, because ** is a block
element, but "nice" should be a single word there, since there is no visual
separation when rendered (** and ** are inline elements).

I can't imagine that we want to_tsvector to know that much about HTML.
It doesn't, really, even have license to assume that its input *is*
HTML. So even if you see things that look like <foo> and </foo> in the
string, it could easily be XML or SGML or some other SGML-like markup
format with different semantics for the markup keywords.

Perhaps it'd be sane to do something like this as long as the
HTML-specific behavior was broken out into a separate function.
(Or maybe it could be done within to_tsvector as a separate parser
or separate dictionary?) But I don't think it should be part of
the default behavior.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Marcelo Zabani

mzabani@gmail.com

about 10 years ago

In reply to: Tom Lane (#2)

Re: Html parsing and inline elements

Hi, Tom,

You're right, I don't think one can argue that the default parser should
know HTML.
How about your suggestion of there being an HTML parser, is it feasible? I
ask this because I think that a lot of people store HTML documents these
days, and although there probably aren't lots of HTML with words written
along multiple inline elements, it would certainly be nice to have a proper
parser for these use cases.

What do you think?

On Wed, Apr 13, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Show quoted text

Marcelo Zabani <mzabani@gmail.com> writes:

I was here wondering whether HTML parsing should separate tokens that are
not separated by spaces in the original text, but are separated by an
inline element. Let me show you an example:

*SELECT to_tsvector('english', 'Helloneighbor, you are
nice')*
*Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*

"Hello" and "neighbor" should really be separated, because ** is a

block

element, but "nice" should be a single word there, since there is no

visual

separation when rendered (** and ** are inline elements).

I can't imagine that we want to_tsvector to know that much about HTML.
It doesn't, really, even have license to assume that its input *is*
HTML. So even if you see things that look like <foo> and </foo> in the
string, it could easily be XML or SGML or some other SGML-like markup
format with different semantics for the markup keywords.

Perhaps it'd be sane to do something like this as long as the
HTML-specific behavior was broken out into a separate function.
(Or maybe it could be done within to_tsvector as a separate parser
or separate dictionary?) But I don't think it should be part of
the default behavior.

regards, tom lane

Bruce Momjian

bruce@momjian.us

about 10 years ago

In reply to: Marcelo Zabani (#3)

Re: Html parsing and inline elements

On Wed, Apr 13, 2016 at 12:57:19PM -0300, Marcelo Zabani wrote:

Hi, Tom,

You're right, I don't think one can argue that the default parser should know
HTML.
How about your suggestion of there being an HTML parser, is it feasible? I ask
this because I think that a lot of people store HTML documents these days, and
although there probably aren't lots of HTML with words written along multiple
inline elements, it would certainly be nice to have a proper parser for these
use cases.

What do you think?

It sounds useful.

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ As you are, so once was I. As I am, so you will be. +
+                     Ancient Roman grave inscription +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

David G. Johnston

david.g.johnston@gmail.com

about 10 years ago

In reply to: Bruce Momjian (#4)

Re: Html parsing and inline elements

On Fri, Apr 29, 2016 at 1:47 PM, Bruce Momjian <bruce@momjian.us> wrote:

On Wed, Apr 13, 2016 at 12:57:19PM -0300, Marcelo Zabani wrote:

Hi, Tom,

You're right, I don't think one can argue that the default parser should

know

HTML.
How about your suggestion of there being an HTML parser, is it feasible?

I ask

this because I think that a lot of people store HTML documents these

days, and

although there probably aren't lots of HTML with words written along

multiple

inline elements, it would certainly be nice to have a proper parser for

these

use cases.

What do you think?

It sounds useful.

It sounds like an external project/extension...

David J.

Oleg Bartunov

oleg@sai.msu.su

about 10 years ago

In reply to: Marcelo Zabani (#3)

Re: Html parsing and inline elements

On Wed, Apr 13, 2016 at 6:57 PM, Marcelo Zabani <mzabani@gmail.com> wrote:

Hi, Tom,

You're right, I don't think one can argue that the default parser should
know HTML.
How about your suggestion of there being an HTML parser, is it feasible? I
ask this because I think that a lot of people store HTML documents these
days, and although there probably aren't lots of HTML with words written
along multiple inline elements, it would certainly be nice to have a proper
parser for these use cases.

What do you think?

I think it could be useful separate parser. But the problem is how to fully
utilize it to facilitate ranking, for example, words in title could be
considered more important than in the body, etc. Currently, setweight()
functions provides this separately from parser.

Parser outputs tokid and token:

If we change parser to output also rank flag, then we could use it to
assign different weights.

Show quoted text

On Wed, Apr 13, 2016 at 11:09 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Marcelo Zabani <mzabani@gmail.com> writes:

I was here wondering whether HTML parsing should separate tokens that

are

not separated by spaces in the original text, but are separated by an
inline element. Let me show you an example:

*SELECT to_tsvector('english', 'Helloneighbor, you are
nice')*
*Results:** "'ce':7 'hello':1 'n':5 'neighbor':2"*

"Hello" and "neighbor" should really be separated, because ** is a

block

element, but "nice" should be a single word there, since there is no

visual

separation when rendered (** and ** are inline elements).

I can't imagine that we want to_tsvector to know that much about HTML.
It doesn't, really, even have license to assume that its input *is*
HTML. So even if you see things that look like <foo> and </foo> in the
string, it could easily be XML or SGML or some other SGML-like markup
format with different semantics for the markup keywords.

Perhaps it'd be sane to do something like this as long as the
HTML-specific behavior was broken out into a separate function.
(Or maybe it could be done within to_tsvector as a separate parser
or separate dictionary?) But I don't think it should be part of
the default behavior.

regards, tom lane

Ryan Pedela

rpedela@datalanche.com

about 10 years ago

In reply to: Marcelo Zabani (#3)

Re: Html parsing and inline elements

On Wed, Apr 13, 2016 at 9:57 AM, Marcelo Zabani <mzabani@gmail.com> wrote:

Hi, Tom,

You're right, I don't think one can argue that the default parser should
know HTML.
How about your suggestion of there being an HTML parser, is it feasible? I
ask this because I think that a lot of people store HTML documents these
days, and although there probably aren't lots of HTML with words written
along multiple inline elements, it would certainly be nice to have a proper
parser for these use cases.

What do you think?

I recommend using Apache Tika [1] for plain text extraction from HTML.
There are so many weird edge cases when parsing HTML that it is easier to
use something that is already mature than reinventing the wheel.

1. https://tika.apache.org/

Thanks,
Ryan Pedela