no mailing list hits in google

Started by Merlin Moncureover 6 years ago19 messages
#1Merlin Moncure
mmoncure@gmail.com

Hackers,
[apologies if this is the incorrect list or is already discussed material]

I've noticed that mailing list discussions in -hackers and other
mailing lists appear to not be indexed by google -- at all. We are
also not being tracked by any mailing list aggregators -- in contrast
to a decade ago where we had nabble and other systems to collect and
organize results (tbh, often better than we do) we are now at an
extreme disadvantage; mailing list activity was formerly and
absolutely fantastic research via google to find solutions to obscure
technical problems in the database. Limited access to this
information will directly lead to increased bug reports, lack of
solution confidence, etc.

My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
I was searching out a link to recent bug report for copy/paste into
corporate email. In the old days this would fire right up but now
returns no hits even though the discussion is available in the
archives (which I had to find by looking up the specific day the
thread was active). Just a heads up.

merlin

#2Tom Lane
tgl@sss.pgh.pa.us
In reply to: Merlin Moncure (#1)
Re: no mailing list hits in google

Merlin Moncure <mmoncure@gmail.com> writes:

[apologies if this is the incorrect list or is already discussed material]

It's not the right list; redirecting to pgsql-www.

I've noticed that mailing list discussions in -hackers and other
mailing lists appear to not be indexed by google -- at all. We are
also not being tracked by any mailing list aggregators -- in contrast
to a decade ago where we had nabble and other systems to collect and
organize results (tbh, often better than we do) we are now at an
extreme disadvantage; mailing list activity was formerly and
absolutely fantastic research via google to find solutions to obscure
technical problems in the database. Limited access to this
information will directly lead to increased bug reports, lack of
solution confidence, etc.

My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
I was searching out a link to recent bug report for copy/paste into
corporate email. In the old days this would fire right up but now
returns no hits even though the discussion is available in the
archives (which I had to find by looking up the specific day the
thread was active). Just a heads up.

Hm. When I try googling that, the first thing I get is

pgsql-hackers - PostgreSQL

https://www.postgresql.org › list › pgsql-hackers
No information is available for this page.
Learn why

and the "learn why" link says that "You are seeing this result because the
page is blocked by a robots.txt file on your website."

So somebody has blocked the archives from being indexed.
Seems like a bad idea.

regards, tom lane

#3Magnus Hagander
magnus@hagander.net
In reply to: Tom Lane (#2)
Re: no mailing list hits in google

On Wed, Aug 28, 2019 at 6:51 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:

Merlin Moncure <mmoncure@gmail.com> writes:

[apologies if this is the incorrect list or is already discussed

material]

It's not the right list; redirecting to pgsql-www.

I've noticed that mailing list discussions in -hackers and other
mailing lists appear to not be indexed by google -- at all. We are
also not being tracked by any mailing list aggregators -- in contrast
to a decade ago where we had nabble and other systems to collect and
organize results (tbh, often better than we do) we are now at an
extreme disadvantage; mailing list activity was formerly and
absolutely fantastic research via google to find solutions to obscure
technical problems in the database. Limited access to this
information will directly lead to increased bug reports, lack of
solution confidence, etc.

My test case here is the query: pgsql-hackers ExecHashJoinNewBatch
I was searching out a link to recent bug report for copy/paste into
corporate email. In the old days this would fire right up but now
returns no hits even though the discussion is available in the
archives (which I had to find by looking up the specific day the
thread was active). Just a heads up.

Hm. When I try googling that, the first thing I get is

pgsql-hackers - PostgreSQL

https://www.postgresql.org › list › pgsql-hackers
No information is available for this page.
Learn why

and the "learn why" link says that "You are seeing this result because the
page is blocked by a robots.txt file on your website."

So somebody has blocked the archives from being indexed.
Seems like a bad idea.

It blocks /list/ which has the subjects only. The actual emails in
/message-id/ are not blocked by robots.txt. I don't know why they stopped
appearing in the searches... Nothing has been changed around that for many
years from *our* side.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#4Andres Freund
andres@anarazel.de
In reply to: Merlin Moncure (#1)
Re: no mailing list hits in google

Hi,

On August 28, 2019 9:22:44 AM PDT, Merlin Moncure <mmoncure@gmail.com> wrote:

Hackers,
[apologies if this is the incorrect list or is already discussed
material]

Probably should be on the -www list. Redirecting. Please trim in future replies.

I've noticed that mailing list discussions in -hackers and other
mailing lists appear to not be indexed by google -- at all.

I noticed that there's fewer and fewer hits too. Pretty annoying. I have an online archive I can search, but that's not something everyone should have to do.

I think it's because robots.txt tells search engines to ignore the lists. Quite hard to understand how that's a good idea.

https://www.postgresql.org/robots.txt

User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /docs/devel/
Disallow: /list/
Disallow: /search/
Disallow: /message-id/raw/
Disallow: /message-id/flat/

Sitemap: https://www.postgresql.org/sitemap.xml

Without /list, there's no links to the individual messages. So there needs to be another external reference for a search engine to arrive at individual messages.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#5Andres Freund
andres@anarazel.de
In reply to: Magnus Hagander (#3)
Re: no mailing list hits in google

Hi,

On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:

It blocks /list/ which has the subjects only.

Yea. But there's no way to actually get to all the individual messages
without /list/? Sure, some will be linked to from somewhere else, but
without the content below /list/, most won't be reached?

Why is that /list/ exclusion there in the first place?

Nothing has been changed around that for many years from *our* side.

Any chance that there previously still was an archives.postgresql.org
view or such that allowed to reach the individual messages without being
blocked by robots.txt?

Greetings,

Andres Freund

#6Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#4)
Re: no mailing list hits in google

Hi,

On 2019-08-28 10:26:35 -0700, Andres Freund wrote:

On August 28, 2019 9:22:44 AM PDT, Merlin Moncure <mmoncure@gmail.com> wrote:

Hackers,
[apologies if this is the incorrect list or is already discussed
material]

Probably should be on the -www list. Redirecting. Please trim in future replies.

I've noticed that mailing list discussions in -hackers and other
mailing lists appear to not be indexed by google -- at all.

I noticed that there's fewer and fewer hits too. Pretty annoying. I have an online archive I can search, but that's not something everyone should have to do.

I think it's because robots.txt tells search engines to ignore the lists. Quite hard to understand how that's a good idea.

https://www.postgresql.org/robots.txt

User-agent: *
Disallow: /admin/
Disallow: /account/
Disallow: /docs/devel/
Disallow: /list/
Disallow: /search/
Disallow: /message-id/raw/
Disallow: /message-id/flat/

Sitemap: https://www.postgresql.org/sitemap.xml

Without /list, there's no links to the individual messages. So there needs to be another external reference for a search engine to arrive at individual messages.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

For reasons that I do not understand, the previous mail had a broken
html part, making the above message invisible for people viewing the
html part.

Greetings,

Andres Freund

#7Tom Lane
tgl@sss.pgh.pa.us
In reply to: Magnus Hagander (#3)
Re: no mailing list hits in google

Magnus Hagander <magnus@hagander.net> writes:

It blocks /list/ which has the subjects only. The actual emails in
/message-id/ are not blocked by robots.txt. I don't know why they stopped
appearing in the searches... Nothing has been changed around that for many
years from *our* side.

If I go to

https://www.postgresql.org/message-id/

I get a page saying "Not Found". So I'm not clear on how a web crawler
would descend through that to individual messages.

Even if it looks different to a robot, what would it look like exactly?
A flat space of umpteen zillion immediate-child pages? It seems not
improbable that Google's search engine would intentionally decide not to
index that, or unintentionally just fail due to some internal resource
limit. (This theory can explain why it used to work and no longer does:
we got past whatever the limit is.)

Andres' idea of allowing access to /list/ would allow the archives to be
traversed in more bite-size pieces, which might fix the issue.

regards, tom lane

#8Thomas Kellerer
shammat@gmx.net
In reply to: Merlin Moncure (#1)
Re: no mailing list hits in google

Merlin Moncure schrieb am 28.08.2019 um 18:22:

My test case here is the query: pgsql-hackers

That search term is the first hit on DuckDuckGo:
https://duckduckgo.com/?q=pgsql-hackers+ExecHashJoinNewBatch&amp;t=h_&amp;ia=web

Searching for "postgres ExecHashJoinNewBatch" returns that ot position 4
https://duckduckgo.com/?q=postgres+ExecHashJoinNewBatch&amp;t=h_&amp;ia=web

#9Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Thomas Kellerer (#8)
Re: no mailing list hits in google

On 2019-Aug-28, Thomas Kellerer wrote:

Merlin Moncure schrieb am 28.08.2019 um 18:22:

My test case here is the query: pgsql-hackers

That search term is the first hit on DuckDuckGo:
https://duckduckgo.com/?q=pgsql-hackers+ExecHashJoinNewBatch&amp;t=h_&amp;ia=web

Yes, but that's an old post, not the one from this year.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#10Magnus Hagander
magnus@hagander.net
In reply to: Andres Freund (#5)
Re: no mailing list hits in google

On Wed, Aug 28, 2019 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:

It blocks /list/ which has the subjects only.

Yea. But there's no way to actually get to all the individual messages
without /list/? Sure, some will be linked to from somewhere else, but
without the content below /list/, most won't be reached?

That is indeed a good point. But it has been that way for many years, so
something must've changed. We last modified this in 2013....

Maybe Google used to load the pages under /list/ and crawl them for links
but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think that
should work -- will need some investigation on exactly how to do it, as
sitemaps also have individual restrictions on the number of urls per file,
and we do have quite a few messages.

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to
the fact that you can pick an arbitrary point in time to view from.

Nothing has been changed around that for many years from *our* side.

Any chance that there previously still was an archives.postgresql.org
view or such that allowed to reach the individual messages without being
blocked by robots.txt?

That one had a robots.txt blocking this going back even further in time.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#11Magnus Hagander
magnus@hagander.net
In reply to: Alvaro Herrera (#9)
Re: no mailing list hits in google

On Wed, Aug 28, 2019 at 10:31 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

On 2019-Aug-28, Thomas Kellerer wrote:

Merlin Moncure schrieb am 28.08.2019 um 18:22:

My test case here is the query: pgsql-hackers

That search term is the first hit on DuckDuckGo:
https://duckduckgo.com/?q=pgsql-hackers+ExecHashJoinNewBatch&amp;t=h_&amp;ia=web

Yes, but that's an old post, not the one from this year.

It does show another interesting point though -- it *also* includes hits
from third party list archiving sites, which are *also* gone from Google at
this point. And those are definitely not gone from Google because we have a
robots.txt blocking /list/ -- it must be something else.

//Magnus

#12Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Magnus Hagander (#10)
Re: no mailing list hits in google

On 2019-Aug-29, Magnus Hagander wrote:

Maybe Google used to load the pages under /list/ and crawl them for links
but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think that
should work -- will need some investigation on exactly how to do it, as
sitemaps also have individual restrictions on the number of urls per file,
and we do have quite a few messages.

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to
the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once. Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#13Magnus Hagander
magnus@hagander.net
In reply to: Alvaro Herrera (#12)
Re: no mailing list hits in google

On Thu, Aug 29, 2019 at 3:32 PM Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:

On 2019-Aug-29, Magnus Hagander wrote:

Maybe Google used to load the pages under /list/ and crawl them for links
but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think that
should work -- will need some investigation on exactly how to do it, as
sitemaps also have individual restrictions on the number of urls per

file,

and we do have quite a few messages.

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due

to

the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once. Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

That's pretty much what I'm suggesting but using a sitemap so it's directly
injected.

//Magnus

#14Andres Freund
andres@anarazel.de
In reply to: Magnus Hagander (#10)
Re: no mailing list hits in google

Hi,

On 2019-08-29 13:12:00 +0200, Magnus Hagander wrote:

On Wed, Aug 28, 2019 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-08-28 19:09:40 +0200, Magnus Hagander wrote:

It blocks /list/ which has the subjects only.

Yea. But there's no way to actually get to all the individual messages
without /list/? Sure, some will be linked to from somewhere else, but
without the content below /list/, most won't be reached?

That is indeed a good point. But it has been that way for many years, so
something must've changed. We last modified this in 2013....

Hm. I guess it's possible that most pages were found due to the
next/prev links in individual messages, once one of them is linked from
somewhere externally. Any chance there's enough logs around to see
from where to where the indexers currently move?

I wonder if we can inject these into Google using a sitemap. I think that
should work -- will need some investigation on exactly how to do it, as
sitemaps also have individual restrictions on the number of urls per file,
and we do have quite a few messages.

Hm. You mean in addition to allowing /list/ or solely?

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to
the fact that you can pick an arbitrary point in time to view from.

You mean because of the per-day links, that aren't really per-day? I
think the number of links due to that would still be manageable traffic
wise? Or are they that expensive to compute? Perhaps we could make the
"jump to day" links smarter in some way? Perhaps by not including
content for the following days in the per-day pages?

Greetings,

Andres Freund

#15Andres Freund
andres@anarazel.de
In reply to: Alvaro Herrera (#12)
Re: no mailing list hits in google

Hi,

On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:

On 2019-Aug-29, Magnus Hagander wrote:

Maybe Google used to load the pages under /list/ and crawl them for links
but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think that
should work -- will need some investigation on exactly how to do it, as
sitemaps also have individual restrictions on the number of urls per file,
and we do have quite a few messages.

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to
the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once. Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

Hm. Weren't there occasionally downranking rules for pages that were
clearly aimed just at search engines? Honestly I find the current
navigation with the overlapping content to be not great for humans too,
so I think it might be worthwhile to rather improve the general
navigation and allow robots for /list/. But if that's too much/not well
specified enough: perhaps we could mark the per-day links as
rel=nofollow, but not the prev/next links when starting at certain
boundaries?

Greetings,

Andres Freund

#16Daniel Gustafsson
daniel@yesql.se
In reply to: Andres Freund (#15)
Re: no mailing list hits in google

On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de> wrote:
On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:

On 2019-Aug-29, Magnus Hagander wrote:

Maybe Google used to load the pages under /list/ and crawl them for links
but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think that
should work -- will need some investigation on exactly how to do it, as
sitemaps also have individual restrictions on the number of urls per file,
and we do have quite a few messages.

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to
the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once. Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

Hm. Weren't there occasionally downranking rules for pages that were
clearly aimed just at search engines?

I think that’s mainly been for pages which are clearly keyword spamming, I
doubt our content would get caught there. The sitemap, as proposed upthread,
is the solution to this however and is also the recommended way from Google for
sites with lots of content.

Google does however explicitly downrank duplicated/similar content, or content
which can be reached via multiple URLs and which doesn’t list a canonical URL
in the page. A single message and the whole-thread link does contain the same
content, and neither are canonical so we might be incurring penalties from
that. Also, the postgr.es/m/ shortener makes content available via two URLs,
without a canonical URL specified.

That being said, since we haven’t changed anything, and DuckDuckGo happily
index the mailinglist posts, this smells a lot more like a policy change than a
technical change if my experience with Google SEO is anything to go by. The
Webmaster Tools Search Console can quite often give insights as to why a page
is missing, that’s probably a better place to start then second guessing Google
SEO. AFAICR, using that requires proving that one owns the site/domain, but
doesn’t require adding any google trackers or similar things.

cheers ./daniel

#17Magnus Hagander
magnus@hagander.net
In reply to: Daniel Gustafsson (#16)
Re: no mailing list hits in google

On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se> wrote:

On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de> wrote:
On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:

On 2019-Aug-29, Magnus Hagander wrote:

Maybe Google used to load the pages under /list/ and crawl them for

links

but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think

that

should work -- will need some investigation on exactly how to do it, as
sitemaps also have individual restrictions on the number of urls per

file,

and we do have quite a few messages.

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space,

due to

the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once. Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

Hm. Weren't there occasionally downranking rules for pages that were
clearly aimed just at search engines?

I think that’s mainly been for pages which are clearly keyword spamming, I
doubt our content would get caught there. The sitemap, as proposed
upthread,
is the solution to this however and is also the recommended way from
Google for
sites with lots of content.

Google does however explicitly downrank duplicated/similar content, or
content
which can be reached via multiple URLs and which doesn’t list a canonical
URL
in the page. A single message and the whole-thread link does contain the
same
content, and neither are canonical so we might be incurring penalties from
that. Also, the postgr.es/m/ shortener makes content available via two
URLs,
without a canonical URL specified.

But robots.txt blocks the whole-thread view (and this is the reason for it).
And postgr.es/m/ does not actually make the content available there, it
redirects.

So I don't think those should actually have an effect?

That being said, since we haven’t changed anything, and DuckDuckGo happily

index the mailinglist posts, this smells a lot more like a policy change
than a
technical change if my experience with Google SEO is anything to go by.
The
Webmaster Tools Search Console can quite often give insights as to why a
page
is missing, that’s probably a better place to start then second guessing
Google
SEO. AFAICR, using that requires proving that one owns the site/domain,
but
doesn’t require adding any google trackers or similar things.

I've tried but failed to get any relevant data out of it. It does clearly
show large amounts of URLs blocked because they are in /flat/ or /raw/, but
nothing at all about the regular messages.

--
Magnus Hagander
Me: https://www.hagander.net/ <http://www.hagander.net/&gt;
Work: https://www.redpill-linpro.com/ <http://www.redpill-linpro.com/&gt;

#18Daniel Gustafsson
daniel@yesql.se
In reply to: Magnus Hagander (#17)
Re: no mailing list hits in google

On 30 Aug 2019, at 12:08, Magnus Hagander <magnus@hagander.net> wrote:
On Fri, Aug 30, 2019 at 11:40 AM Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> wrote:

On 29 Aug 2019, at 16:55, Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> wrote:
On 2019-08-29 09:32:35 -0400, Alvaro Herrera wrote:

On 2019-Aug-29, Magnus Hagander wrote:

Maybe Google used to load the pages under /list/ and crawl them for links
but just not include the actual pages in the index or something

I wonder if we can inject these into Google using a sitemap. I think that
should work -- will need some investigation on exactly how to do it, as
sitemaps also have individual restrictions on the number of urls per file,
and we do have quite a few messages.

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to
the fact that you can pick an arbitrary point in time to view from.

Maybe we can create a new page that's specifically to be used by
crawlers, that lists all emails, each only once. Say (unimaginatively)
/list_crawlers/2019-08/ containing links to all emails of all public
lists occurring during August 2019.

Hm. Weren't there occasionally downranking rules for pages that were
clearly aimed just at search engines?

I think that’s mainly been for pages which are clearly keyword spamming, I
doubt our content would get caught there. The sitemap, as proposed upthread,
is the solution to this however and is also the recommended way from Google for
sites with lots of content.

Google does however explicitly downrank duplicated/similar content, or content
which can be reached via multiple URLs and which doesn’t list a canonical URL
in the page. A single message and the whole-thread link does contain the same
content, and neither are canonical so we might be incurring penalties from
that. Also, the postgr.es/m/ </messages/by-id/&gt; shortener makes content available via two URLs,
without a canonical URL specified.

But robots.txt blocks the whole-thread view (and this is the reason for it).

Maybe that’s part of the explanation, since Google no longer wants sites to use
robots.txt for restricting crawlers on what to index (contrary to much indexing
advice which is vague at best they actually say so explicitly)? Being in
robots.txt doesn’t restrict the page from being indexed, if it is linked to
from somewhere else with enough context etc (for example if a thread is
reproduced on a forum with a link to /message-id/raw). Their recommended way
is to mark the page with noindex:

<meta name=“robots” content=“noindex” />

And postgr.es/m/ </messages/by-id/&gt; does not actually make the content available there, it redirects.

Right, but a 301 redirect is considered by Google as deprecating the old page,
which may or may not throw the indexer off since we continue to use
postgr.es/m/ without a canonicalization?

So I don't think those should actually have an effect?

That could very well be true, as with most things SEO it’s all a guessing game.

That being said, since we haven’t changed anything, and DuckDuckGo happily
index the mailinglist posts, this smells a lot more like a policy change than a
technical change if my experience with Google SEO is anything to go by. The
Webmaster Tools Search Console can quite often give insights as to why a page
is missing, that’s probably a better place to start then second guessing Google
SEO. AFAICR, using that requires proving that one owns the site/domain, but
doesn’t require adding any google trackers or similar things.

I've tried but failed to get any relevant data out of it. It does clearly show large amounts of URLs blocked because they are in /flat/ or /raw/, but nothing at all about the regular messages.

That’s disappointing, I’ve gotten quite good advice there in past.

cheers ./daniel

#19Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#14)
Re: no mailing list hits in google

Hi,

This got brought up again on in a twitter discussion, see
https://twitter.com/AndresFreundTec/status/1403418002951794688

On 2019-08-29 07:50:13 -0700, Andres Freund wrote:

Why is that /list/ exclusion there in the first place?

Because there are basically infinite number of pages in that space, due to
the fact that you can pick an arbitrary point in time to view from.

You mean because of the per-day links, that aren't really per-day? I
think the number of links due to that would still be manageable traffic
wise? Or are they that expensive to compute? Perhaps we could make the
"jump to day" links smarter in some way? Perhaps by not including
content for the following days in the per-day pages?

I still don't understand why all of /list/ is in robots.txt. I
understand why we don't necessarily want to index /list/.../since/...,
but prohibiting all of /list/ seems like a extremely poorly aimed
big hammer.

Can't we use wildcards to at least allow everything but the /since/
links? E.g. Disallow: /list/*/since/*. Is it because we're some less
common crawler doesn't implement wildcards at all?

Or slap rel=nofollow on links / add a meta tag preventing /since/ pages
from being indexed.

Yes, that'd not be perfect for the bigger lists, because there's no
"direct" way to get from the month's archive to all the month's emails
when paginated. But there's still the next/prev links. And it'd be much
better than what we have right now.

Greetings,

Andres Freund