How to use full-text search URL parser to filter query results by domain name?
I am trying to understand how to use the full-text search parser for
URLS and hostnames to filter results from a text field containing URLS
based on domain, and also how to index text columns for fast
lookup/matching based on domain.
I have a PostgreSQL database containing documents and links downloaded
by a web crawler, with the following tables:
pages
----------
id: Integer (primary key)
url: String (unique)
title: String
text: String
html: String
last_visit: DateTime
word_pos: TSVECTOR
links
----------
id Integer (primary key)
source: String
target: String
link_text: String
UNIQUE(source,target)
crawls
---------
id: Integer (primary key)
query: String
crawl_results
-------------
id: Integer (primary key)
score: Integer (constraint 0<=score<=1)
crawl_id: Integer (foreign key, crawls.id)
page_id: Integer (foreign key, pages.id)
The `source` and `target` fields in the `links` table contain URLs. I am
running the following query to extract scored links from the top-ranking
search results, for pages that haven't been fetched yet:
WITH top_results AS
(SELECT page_id, score FROM crawl_results
WHERE crawl_id=$1
ORDER BY score LIMIT 100)
SELECT top_results.score, l.target
FROM top_results
JOIN pages p ON top_results.page_id=p.id
JOIN links l on p.url=l.source
WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)
However, *I would like to filter these results so that only one row is
returned for a given domain (the one with the lowest score)*. So for
instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
'http://www.foo.com/zor'), I only want the first because it has same
domain `foo.com` and has the lower score.
I was able to find documentation for the builtin full text search
parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>,
which can parse URLS and extract the hostname. For instance, I can
extract the hostname from a URL as follows:
SELECT token FROM ts_parse('default', 'http://www.foo.com') WHERE tokid = 6;
token
-------------
www.foo.com
(1 row)
However, I can't figure out how I would integrate this into the above
query to filter out duplicate domains from the results. And because this
is the docs for "testing and debugging text search
<https://www.postgresql.org/docs/11/textsearch-debugging.html#TEXTSEARCH-PARSER-TESTING>",
I don't know if this use of `ts_parse()` is even related to how the URL
parser is intended to be used in practice.
How would I use the "host" parser in my query above to return one row
per domain? Also, how would I appropriately index the "links" table for
"host" and "url" token lookup?
Thanks!
I am trying to understand how to use the full-text search parser for
URLS and hostnames to filter results from a text field containing URLS
based on domain, and also how to index text columns for fast
lookup/matching based on domain.I have a PostgreSQL database containing documents and links downloaded
by a web crawler, with the following tables:
pages
----------
id: Integer (primary key)
url: String (unique)
title: String
text: String
html: String
last_visit: DateTime
word_pos: TSVECTOR
>> >> links
----------
id Integer (primary key)
source: String
target: String >> >> link_text: String
UNIQUE(source,target)
>> >> crawls
---------
id: Integer (primary key)
query: String
>> >> crawl_results
-------------
id: Integer (primary key)
score: Integer (constraint 0<=score<=1)
crawl_id: Integer (foreign key, crawls.id)
page_id: Integer (foreign key, pages.id)The `source` and `target` fields in the `links` table contain URLs. I am
running the following query to extract scored links from the top-ranking
search results, for pages that haven't been fetched yet:WITH top_results AS >> >> (SELECT page_id, score FROM crawl_results >> >> WHERE crawl_id=$1 >> >> ORDER BY score LIMIT 100)
SELECT top_results.score, l.target
FROM top_results >> >> JOIN pages p ON top_results.page_id=p.id
JOIN links l on p.url=l.source >> >> WHERE NOT EXISTS (SELECT pp.id FROM pages pp WHERE l.target=pp.url)However, *I would like to filter these results so that only one row is
returned for a given domain (the one with the lowest score)*. So for
instance, if I get (0.3, 'http://www.foo.com/bar') and (0.8,
'http://www.foo.com/zor'), I only want the first because it has same
domain `foo.com` and has the lower score.I was able to find documentation for the builtin full text search
parsers <https://www.postgresql.org/docs/11/textsearch-parsers.html>,
which can parse URLS and extract the hostname. For instance, I can
extract the hostname from a URL as follows:
Hi,
I have no real idea about solving the complete problem, and would probably try
something with a temp table first.
For extracting the hostname from a url you could use
select regex_replace('https?://(.*=)/.*', '\\1', url)
instead of the fulltext parser
Best regards
Wolfgang
On 4/6/19 11:42 PM, hamann.w@t-online.de wrote:
Hi,
I have no real idea about solving the complete problem, and would probably try
something with a temp table first.
For extracting the hostname from a url you could useselect regex_replace('https?://(.*=)/.*', '\\1', url)
instead of the fulltext parser
Best regards
Wolfgang
Thanks Wolfgang, I understand that I could implement a function using
regex for this, or just create an extra column/table to store the
hostname data. But there are other parts of the application where I'll
need to extract URL path, others where i'll want to extract scheme, etc.
Since postgres has builtin capabilities for parsing URLs to do alll of
this, I'd rather just use the builtin functions instead of writing them
myself using regex or having to generate a temp table each time I do a
lookup.
So although I'm aware that there are a variety of ways to extract
hostname (right now I'm just doing it in Python), I'm really most
interested in understanding how to use the builtin Postgres URL parsers
to extract host, url path, etc and how to appropriately create indexes
based on them. The documentation for the URL parser is very sparse, and
I can't find much info online either.
On 07.04.2019 07:06, Jess Wren wrote:
However, I can't figure out how I would integrate this into the above
query to filter out duplicate domains from the results. And because this
is the docs for "testing and debugging text search
<https://www.postgresql.org/docs/11/textsearch-debugging.html#TEXTSEARCH-PARSER-TESTING>",
I don't know if this use of `ts_parse()` is even related to how the URL
parser is intended to be used in practice.How would I use the "host" parser in my query above to return one row
per domain? Also, how would I appropriately index the "links" table for
"host" and "url" token lookup?
I think it is normal to use ts_parse(). And I suppose you might use
windows functions.
For example, you have table links:
=# create table links (score int, link text);
=# insert into links values
(1, 'http://www.foo.com/bar'),
(2, 'http://www.foo.com/foo'),
(2, 'http://www.bar.com/foo'),
(1, 'http://www.bar.com/bar');
You can use the following query:
=# with l as (
select score, token, link,
rank() over (partition by token order by score) as rank
from links,
lateral ts_parse('default', link)
where tokid = 6)
select score, token, link from l where rank = 1;
score | token | link
-------+-------------+------------------------
1 | www.bar.com | http://www.bar.com/bar
1 | www.foo.com | http://www.foo.com/bar
It is just the idea, probably the query might be simpler.
--
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company
On 4/8/19 4:50 AM, Arthur Zakirov wrote:
I think it is normal to use ts_parse(). And I suppose you might use
windows functions.For example, you have table links:
=# create table links (score int, link text);
=# insert into links values
(1, 'http://www.foo.com/bar'),
(2, 'http://www.foo.com/foo'),
(2, 'http://www.bar.com/foo'),
(1, 'http://www.bar.com/bar');You can use the following query:
=# with l as (
select score, token, link,
rank() over (partition by token order by score) as rank
from links,
lateral ts_parse('default', link)
where tokid = 6)
select score, token, link from l where rank = 1;
score | token | link
-------+-------------+------------------------
1 | www.bar.com | http://www.bar.com/bar
1 | www.foo.com | http://www.foo.com/bar
Thank you very much Arthur. Your suggestion led me to a query that is at
least returning correct result set. I could not figure out how to get
your rank() function to work with my query, but building on your answer
(and others from IRC etc), I ended up with the following solution:
First I created the following views:
|CREATE VIEW scored_pages AS ( SELECT crawl_results.crawl_id,
crawl_results.score, crawl_results.page_id, pages.url FROM crawl_results
JOIN pages ON crawl_results.page_id = pages.id ); CREATE VIEW
scored_links AS ( SELECT scored_pages.score, links.source, links.target,
links.link_text FROM links JOIN scored_pages ON scored_pages.url =
links.source );|
Then, using these views, I did the following query to extract the links
from the lowest scored pages in the results:
||SELECTscore,host,target FROM(SELECTDISTINCTON(token)token
AShost,score,target FROMscored_links,LATERAL
ts_parse('default',target)WHEREtokid =6ORDERBYtoken,score )asx
WHERENOTEXISTS(SELECTpp.id FROMpages pp WHEREtarget=pp.url)ORDERBYscore; ||
Does this seem like a reasonable approach? When running EXPLAIN on this
query, I get the following:
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Sort (cost=1252927.46..1252927.47 rows=1 width=100)
Sort Key: crawl_results.score
-> Hash Anti Join (cost=1248297.18..1252927.45 rows=1 width=100)
Hash Cond: ((links.target)::text = (pp.url)::text)
-> Unique (cost=1247961.08..1252591.28 rows=5 width=100)
-> Sort (cost=1247961.08..1250276.18 rows=926040 width=100)
Sort Key: ts_parse.token, crawl_results.score
-> Gather (cost=1449.79..1054897.20 rows=926040 width=100)
Workers Planned: 2
-> Hash Join (cost=449.79..961293.20 rows=385850 width=100)
Hash Cond: ((links.source)::text = (pages.url)::text)
-> Nested Loop (cost=0.00..955091.41 rows=378702 width=144)
-> Parallel Seq Scan on links (cost=0.00..4554.40 rows=75740 width=112)
-> Function Scan on ts_parse (cost=0.00..12.50 rows=5 width=32)
Filter: (tokid = 6)
-> Hash (cost=404.67..404.67 rows=3609 width=63)
-> Hash Join (cost=336.10..404.67 rows=3609 width=63)
Hash Cond: (crawl_results.page_id = pages.id)
-> Seq Scan on crawl_results (cost=0.00..59.09 rows=3609 width=12)
-> Hash (cost=291.60..291.60 rows=3560 width=59)
-> Seq Scan on pages (cost=0.00..291.60 rows=3560 width=59)
-> Hash (cost=291.60..291.60 rows=3560 width=55)
-> Seq Scan on pages pp (cost=0.00..291.60 rows=3560 width=55)
(23 rows)
I am wondering if there is a more efficient way to do things? Some
people on IRC mentioned that it might be better to declare a scalar
function to return the host from ts_parse instead of the LATERAL query
... but I couldn't figure out how to do that, or if it was even
preferable to the above from a performance standpoint ... any ideas on
how I could improve the above.
On Wed, Apr 10, 2019 at 1:58 AM Jess Wren <jess.wren@interference.cc> wrote:
-> Parallel Seq Scan on links
(cost=0.00..4554.40 rows=75740 width=112)-> Function Scan on ts_parse (cost=0.00..12.50 rows=5 width=32)
Filter: (tokid = 6)
(23 rows)I am wondering if there is a more efficient way to do things? Some people
on IRC mentioned that it might be better to declare a scalar function to
return the host from ts_parse instead of the LATERAL query ... but I
couldn't figure out how to do that, or if it was even preferable to the
above from a performance standpoint ... any ideas on how I could improve
the above.
May try indexing the parsed expression to avoid the seq scan on links,
something like:
create index on links (ts_parse('default', target));
and then run the explain (or explain analyze) to see if that improves
things. Certainly as the links table gets bigger this should help.