Ideas for building a system that parses medical research publications/articles
Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store
metadata for this paper ,some kind of global ID if applicable, authors,
areas of research, whether the paper is "new", "highlighted",
"historical", type (e.g. Case reports, Clinical trials), symptoms (e.g.
tics, GI pain, psychological changes, anxiety, ), and other key
attributes (I guess dynamic), it must be full text searchable, etc.
I am at the very beginning in this and it is done on a fully volunteer
basis.
Lots of questions : is there any scientific/scholar analysis software
already available? If yes and is really good and open source , then this
will influence the rest of decisions. Otherwise , I'll have to form a
team that can write one, in this case I'll have to decide DB, language,
etc. I work 20 years with pgsql so it is the natural choice for any kind
of data, I just ask this for the sake of completeness.
All ideas welcome.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 10:49, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store
metadata for this paper ,some kind of global ID if applicable, authors,
areas of research, whether the paper is "new", "highlighted",
"historical", type (e.g. Case reports, Clinical trials), symptoms (e.g.
tics, GI pain, psychological changes, anxiety, ), and other key
attributes (I guess dynamic), it must be full text searchable, etc.I am at the very beginning in this and it is done on a fully volunteer
basis.Lots of questions : is there any scientific/scholar analysis software
already available? If yes and is really good and open source , then this
will influence the rest of decisions. Otherwise , I'll have to form a
team that can write one, in this case I'll have to decide DB, language,
etc. I work 20 years with pgsql so it is the natural choice for any kind
of data, I just ask this for the sake of completeness.All ideas welcome.
Hello Achilleas
Not wishing to be discouraging, but you have very ambitious goals for what sounds like a one-person project ?
You are effectively looking at competing with platforms such as Elsevier Scopus/Scival which are market-leaders in the area for good reason (i.e. it takes a lot of manpower to write algorithms, manage metadata etc., and the only way to consistently maintain that manpower is to employ people, lots of them). There are also things like Google Scholar around the place.
I think before starting on the technical side of Postgres etc., the honest truth is that you need to do more planning, both in terms of implementation and long-term sustainability.
For example, before we even get to metadata, you talk of various sources and formats. Have you considered licensing issues ? Have you considered how to keep the dataset clean ? (If you are thinking you can just scrape the web, then you'll be in for a surprise).
Laura
Στις 5/6/21 1:52 μ.μ., ο/η Laura Smith έγραψε:
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 10:49, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store
metadata for this paper ,some kind of global ID if applicable, authors,
areas of research, whether the paper is "new", "highlighted",
"historical", type (e.g. Case reports, Clinical trials), symptoms (e.g.
tics, GI pain, psychological changes, anxiety, ), and other key
attributes (I guess dynamic), it must be full text searchable, etc.I am at the very beginning in this and it is done on a fully volunteer
basis.Lots of questions : is there any scientific/scholar analysis software
already available? If yes and is really good and open source , then this
will influence the rest of decisions. Otherwise , I'll have to form a
team that can write one, in this case I'll have to decide DB, language,
etc. I work 20 years with pgsql so it is the natural choice for any kind
of data, I just ask this for the sake of completeness.All ideas welcome.
Hello Achilleas
Not wishing to be discouraging, but you have very ambitious goals for what sounds like a one-person project ?
You are effectively looking at competing with platforms such as Elsevier Scopus/Scival which are market-leaders in the area for good reason (i.e. it takes a lot of manpower to write algorithms, manage metadata etc., and the only way to consistently maintain that manpower is to employ people, lots of them). There are also things like Google Scholar around the place.
I think before starting on the technical side of Postgres etc., the honest truth is that you need to do more planning, both in terms of implementation and long-term sustainability.
For example, before we even get to metadata, you talk of various sources and formats. Have you considered licensing issues ? Have you considered how to keep the dataset clean ? (If you are thinking you can just scrape the web, then you'll be in for a surprise).
All I got is some very vague descriptions coming from either ppl from
the advocacy side or the medical side.
I got no idea on the legal status of those documents, as you know some
are covered by the artistic license (a few in PubMed) some not,
I am not a lawyer. The data are not to be stored locally AFAIK, so only
metadata will be kept locally and can be reset, refreshed, amended, etc
Parsing will be equivalent to a one-off human reading the article on the
web. There is a lawyer handling all those. From the whole network of ppl
interested in this whole endeavor, I am the only guy with DB/software
knowledge, hence why I volunteered.
I know its a huge work, but you are missing a point. Nobody wishes to
compete with anyone. This is a about a project, a parent-advocacy
non-profit that *ONLY* aims to save the sick children (or maybe also
very young adults) of a certain spectrum . So the goal is to make the
right tools for researchers, clinicians and parents. This market is too
small to even consider making any money out of it, but the research is
still very expensive and the progress slower than optimum.
Show quoted text
Laura
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store
metadata for this paper ,some kind of global ID if applicable, authors,
areas of research, whether the paper is "new", "highlighted",
"historical", type
Those three categories won't help much. I'm sure though you had
something specific in mind with them ?
Karsten
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 12:14, Achilleas Mantzios <achill@matrix.gatewaynet.com> wrote:
I know its a huge work, but you are missing a point. Nobody wishes to
compete with anyone. This is a about a project, a parent-advocacy
non-profit that ONLY aims to save the sick children (or maybe also
very young adults) of a certain spectrum . So the goal is to make the
right tools for researchers, clinicians and parents. This market is too
small to even consider making any money out of it, but the research is
still very expensive and the progress slower than optimum.
Unfortunately I'm not "missing a point", your final paragraph summarises your position.
You have been taken in by the very charitable goal of saving sick children.
Unfortunately your head has been disconnected from your heart.
If we put the charitable purpose to one side and take a purely objective view at what you want to do, my original statement still stands, i.e. the certainty that you are grossly underestimating the technical and practical complexities of what you want to achieve.
To get started with collecting doc metadata. It looks this tool can help
you started.
postgres does support fuzzy text search, so I do think dumping meta data
/abstract in postgresql and then using trigram tsearch etc like extensions
it should work well for a POC.
this being a pg mailing list :) what would be your expectation of type of
data and growth of data would be your queries.
If you store data to support multiple lingual papers, will postgresql be
able to handle ?
Ideally the docs would be stored somewhere on a object storage etc and the
link of the same would be stored in the db when someone would request to
read the whole paper.
Long before I read this
https://www.citusdata.com/blog/2017/04/20/analyzing-postgresql-email-archives/
So if this could work, your POC should too :) with postgresql.
On Sat, 5 Jun 2021 at 5:14 PM Laura Smith <
n5d9xq3ti233xiyif2vp@protonmail.ch> wrote:
Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 12:14, Achilleas Mantzios <
achill@matrix.gatewaynet.com> wrote:I know its a huge work, but you are missing a point. Nobody wishes to
compete with anyone. This is a about a project, a parent-advocacy
non-profit that ONLY aims to save the sick children (or maybe also
very young adults) of a certain spectrum . So the goal is to make the
right tools for researchers, clinicians and parents. This market is too
small to even consider making any money out of it, but the research is
still very expensive and the progress slower than optimum.Unfortunately I'm not "missing a point", your final paragraph summarises
your position.You have been taken in by the very charitable goal of saving sick children.
Unfortunately your head has been disconnected from your heart.
If we put the charitable purpose to one side and take a purely objective
view at what you want to do, my original statement still stands, i.e. the
certainty that you are grossly underestimating the technical and practical
complexities of what you want to achieve.--
Thanks,
Vijay
Mumbai, India
On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store
metadata for this paper ,some kind of global ID if applicable, authors,
areas of research, whether the paper is "new", "highlighted",
"historical", type (e.g. Case reports, Clinical trials), symptoms (e.g.
tics, GI pain, psychological changes, anxiety, ), and other key
attributes (I guess dynamic), it must be full text searchable, etc.I am at the very beginning in this and it is done on a fully volunteer
basis.Lots of questions : is there any scientific/scholar analysis software
already available? If yes and is really good and open source , then this
will influence the rest of decisions. Otherwise , I'll have to form a
team that can write one, in this case I'll have to decide DB, language,
etc. I work 20 years with pgsql so it is the natural choice for any kind
of data, I just ask this for the sake of completeness.All ideas welcome.
A quick search found this:
https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/
Might be a good starting point on what is already out there.
There is also this:
The Directory of Open Access Journals
https://doaj.org/
It seems to be a service, not downloadable software.
--
Adrian Klaver
adrian.klaver@aklaver.com
Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store
metadata for this paper ,some kind of global ID if applicable,
authors, areas of research, whether the paper is "new",
"highlighted", "historical", type (e.g. Case reports, Clinical
trials), symptoms (e.g. tics, GI pain, psychological changes,
anxiety, ), and other key attributes (I guess dynamic), it must be
full text searchable, etc.I am at the very beginning in this and it is done on a fully
volunteer basis.Lots of questions : is there any scientific/scholar analysis software
already available? If yes and is really good and open source , then
this will influence the rest of decisions. Otherwise , I'll have to
form a team that can write one, in this case I'll have to decide DB,
language, etc. I work 20 years with pgsql so it is the natural choice
for any kind of data, I just ask this for the sake of completeness.All ideas welcome.
A quick search found this:
https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/
Might be a good starting point on what is already out there.
This is interesting, so the keywords are "Data Catalog" ?
There is also this:
The Directory of Open Access Journals
https://doaj.org/
This seems very very poor. Just try a search there and then repeat in
PMC (PubMed Central).
Show quoted text
It seems to be a service, not downloadable software.
Στις 5/6/21 4:45 μ.μ., ο/η Vijaykumar Jain έγραψε:
I checked, it behaves better with downloaded PDF rather than URL PDFs,
in the 2nd case the metadata are poor.
Does not work with nih articles (but this is general problem not tika's )
Show quoted text
To get started with collecting doc metadata. It looks this tool can
help you started.
postgres does support fuzzy text search, so I do think dumping meta
data /abstract in postgresql and then using trigram tsearch etc like
extensions it should work well for a POC.
this being a pg mailing list :) what would be your expectation of type
of data and growth of data would be your queries.
If you store data to support multiple lingual papers, will postgresql
be able to handle ?
Ideally the docs would be stored somewhere on a object storage etc and
the link of the same would be stored in the db when someone would
request to read the whole paper.
Long before I read this
https://www.citusdata.com/blog/2017/04/20/analyzing-postgresql-email-archives/
<https://www.citusdata.com/blog/2017/04/20/analyzing-postgresql-email-archives/>So if this could work, your POC should too :) with postgresql.
On Sat, 5 Jun 2021 at 5:14 PM Laura Smith
<n5d9xq3ti233xiyif2vp@protonmail.ch
<mailto:n5d9xq3ti233xiyif2vp@protonmail.ch>> wrote:Sent with ProtonMail Secure Email.
‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐
On Saturday, 5 June 2021 12:14, Achilleas Mantzios
<achill@matrix.gatewaynet.com
<mailto:achill@matrix.gatewaynet.com>> wrote:I know its a huge work, but you are missing a point. Nobody
wishes to
compete with anyone. This is a about a project, a parent-advocacy
non-profit that ONLY aims to save the sick children (or maybe also
very young adults) of a certain spectrum . So the goal is tomake the
right tools for researchers, clinicians and parents. This market
is too
small to even consider making any money out of it, but the
research is
still very expensive and the progress slower than optimum.
Unfortunately I'm not "missing a point", your final paragraph
summarises your position.You have been taken in by the very charitable goal of saving sick
children.Unfortunately your head has been disconnected from your heart.
If we put the charitable purpose to one side and take a purely
objective view at what you want to do, my original statement still
stands, i.e. the certainty that you are grossly underestimating
the technical and practical complexities of what you want to achieve.--
Thanks,
Vijay
Mumbai, India
On 6/5/21 9:56 AM, Achilleas Mantzios wrote:
Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store
metadata for this paper ,some kind of global ID if applicable,
authors, areas of research, whether the paper is "new",
"highlighted", "historical", type (e.g. Case reports, Clinical
trials), symptoms (e.g. tics, GI pain, psychological changes,
anxiety, ), and other key attributes (I guess dynamic), it must be
full text searchable, etc.I am at the very beginning in this and it is done on a fully
volunteer basis.Lots of questions : is there any scientific/scholar analysis software
already available? If yes and is really good and open source , then
this will influence the rest of decisions. Otherwise , I'll have to
form a team that can write one, in this case I'll have to decide DB,
language, etc. I work 20 years with pgsql so it is the natural choice
for any kind of data, I just ask this for the sake of completeness.All ideas welcome.
A quick search found this:
https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/
Might be a good starting point on what is already out there.
This is interesting, so the keywords are "Data Catalog" ?
What I searched on was 'open source article catalog'.
There is also this:
The Directory of Open Access Journals
https://doaj.org/This seems very very poor. Just try a search there and then repeat in
PMC (PubMed Central).
This is down to copyright issues I'm sure. For PubMed Central see:
https://www.ncbi.nlm.nih.gov/pmc/about/copyright/
for the if/ands/buts that restrict what you can do with the information
and stay legal.
It seems to be a service, not downloadable software.
--
Adrian Klaver
adrian.klaver@aklaver.com
Στις 5/6/21 8:03 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 9:56 AM, Achilleas Mantzios wrote:
Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can
store metadata for this paper ,some kind of global ID if
applicable, authors, areas of research, whether the paper is "new",
"highlighted", "historical", type (e.g. Case reports, Clinical
trials), symptoms (e.g. tics, GI pain, psychological changes,
anxiety, ), and other key attributes (I guess dynamic), it must be
full text searchable, etc.I am at the very beginning in this and it is done on a fully
volunteer basis.Lots of questions : is there any scientific/scholar analysis
software already available? If yes and is really good and open
source , then this will influence the rest of decisions. Otherwise
, I'll have to form a team that can write one, in this case I'll
have to decide DB, language, etc. I work 20 years with pgsql so it
is the natural choice for any kind of data, I just ask this for the
sake of completeness.All ideas welcome.
A quick search found this:
https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/
Might be a good starting point on what is already out there.
This is interesting, so the keywords are "Data Catalog" ?
What I searched on was 'open source article catalog'.
There is also this:
The Directory of Open Access Journals
https://doaj.org/This seems very very poor. Just try a search there and then repeat in
PMC (PubMed Central).This is down to copyright issues I'm sure. For PubMed Central see:
https://www.ncbi.nlm.nih.gov/pmc/about/copyright/
for the if/ands/buts that restrict what you can do with the
information and stay legal.
maybe but still :
https://www.ncbi.nlm.nih.gov/pmc/?term=open+access%5Bfilter%5D+PANDAS+IVIG
Show quoted text
It seems to be a service, not downloadable software.
On 6/5/21 10:39 AM, Achilleas Mantzios wrote:
Στις 5/6/21 8:03 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 9:56 AM, Achilleas Mantzios wrote:
Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can
store metadata for this paper ,some kind of global ID if
applicable, authors, areas of research, whether the paper is "new",
"highlighted", "historical", type (e.g. Case reports, Clinical
trials), symptoms (e.g. tics, GI pain, psychological changes,
anxiety, ), and other key attributes (I guess dynamic), it must be
full text searchable, etc.I am at the very beginning in this and it is done on a fully
volunteer basis.Lots of questions : is there any scientific/scholar analysis
software already available? If yes and is really good and open
source , then this will influence the rest of decisions. Otherwise
, I'll have to form a team that can write one, in this case I'll
have to decide DB, language, etc. I work 20 years with pgsql so it
is the natural choice for any kind of data, I just ask this for the
sake of completeness.All ideas welcome.
A quick search found this:
https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/
Might be a good starting point on what is already out there.
This is interesting, so the keywords are "Data Catalog" ?
What I searched on was 'open source article catalog'.
There is also this:
The Directory of Open Access Journals
https://doaj.org/This seems very very poor. Just try a search there and then repeat in
PMC (PubMed Central).This is down to copyright issues I'm sure. For PubMed Central see:
https://www.ncbi.nlm.nih.gov/pmc/about/copyright/
for the if/ands/buts that restrict what you can do with the
information and stay legal.maybe but still :
https://www.ncbi.nlm.nih.gov/pmc/?term=open+access%5Bfilter%5D+PANDAS+IVIG
Yeah it is nice to have the resources of the NIH behind you. Still I
would point out under Copyright and License information:
"This article is made available via the PMC Open Access Subset for
unrestricted research re-use and secondary analysis in any form or by
any means with acknowledgement of the original source. These permissions
are granted for the duration of the World Health Organization (WHO)
declaration of COVID-19 as a global pandemic."
Further on PMC Open Access Subset:
https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
Again more ifs/ands/buts.
The point being, dealing with articles is a descent into legalese. I am
not saying this is show stopper, just that it will consume considerable
resources to sort out. I for one applaud your effort and given what I
have seen you do with the shipping software over the years I don't see
this project as out of the realm of possibility.
It seems to be a service, not downloadable software.
--
Adrian Klaver
adrian.klaver@aklaver.com
Στις 5/6/21 10:12 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 10:39 AM, Achilleas Mantzios wrote:
Στις 5/6/21 8:03 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 9:56 AM, Achilleas Mantzios wrote:
Στις 5/6/21 6:34 μ.μ., ο/η Adrian Klaver έγραψε:
On 6/5/21 2:49 AM, Achilleas Mantzios wrote:
Hello
I am imagining a system that can parse papers from various
sources (web/files/etc) and in various formats (text, pdf, etc)
and can store metadata for this paper ,some kind of global ID if
applicable, authors, areas of research, whether the paper is
"new", "highlighted", "historical", type (e.g. Case reports,
Clinical trials), symptoms (e.g. tics, GI pain, psychological
changes, anxiety, ), and other key attributes (I guess dynamic),
it must be full text searchable, etc.I am at the very beginning in this and it is done on a fully
volunteer basis.Lots of questions : is there any scientific/scholar analysis
software already available? If yes and is really good and open
source , then this will influence the rest of decisions.
Otherwise , I'll have to form a team that can write one, in this
case I'll have to decide DB, language, etc. I work 20 years with
pgsql so it is the natural choice for any kind of data, I just
ask this for the sake of completeness.All ideas welcome.
A quick search found this:
https://solutionsreview.com/data-management/the-best-open-source-data-catalog-tools-to-consider/
Might be a good starting point on what is already out there.
This is interesting, so the keywords are "Data Catalog" ?
What I searched on was 'open source article catalog'.
There is also this:
The Directory of Open Access Journals
https://doaj.org/This seems very very poor. Just try a search there and then repeat
in PMC (PubMed Central).This is down to copyright issues I'm sure. For PubMed Central see:
https://www.ncbi.nlm.nih.gov/pmc/about/copyright/
for the if/ands/buts that restrict what you can do with the
information and stay legal.maybe but still :
https://www.ncbi.nlm.nih.gov/pmc/?term=open+access%5Bfilter%5D+PANDAS+IVIG
Yeah it is nice to have the resources of the NIH behind you. Still I
would point out under Copyright and License information:"This article is made available via the PMC Open Access Subset for
unrestricted research re-use and secondary analysis in any form or by
any means with acknowledgement of the original source. These
permissions are granted for the duration of the World Health
Organization (WHO) declaration of COVID-19 as a global pandemic."Further on PMC Open Access Subset:
https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
Again more ifs/ands/buts.
The point being, dealing with articles is a descent into legalese. I
am not saying this is show stopper, just that it will consume
considerable resources to sort out. I for one applaud your effort and
given what I have seen you do with the shipping software over the
years I don't see this project as out of the realm of possibility.
Thank you Adrian, there is no money in this project, but the stakes are
much much higher.
Show quoted text
>
It seems to be a service, not downloadable software.
I think the key word here that will help you is biocuration and it's an established field involving people with scientific, computational, and linguistic backgrounds who are familiar with the problem space so I would suggest talking to people working in this area first to get an idea of what's feasible, what's already out there, etc., as they will know this better than the Postgres community.
You can see an example of the sort of annotation that is fully automated at the moment here:
https://monarchinitiative.org/tools/text-annotate
Given the potential impact on human health, some level of manual involvement in annotation is frequently part of the workflow.
Daniel
-----Original Message-----
From: Achilleas Mantzios <achill@matrix.gatewaynet.com>
Sent: 05 June 2021 10:49
To: pgsql-general@lists.postgresql.org
Subject: Ideas for building a system that parses medical research publications/articles [EXT]
Hello
I am imagining a system that can parse papers from various sources
(web/files/etc) and in various formats (text, pdf, etc) and can store metadata for this paper ,some kind of global ID if applicable, authors, areas of research, whether the paper is "new", "highlighted", "historical", type (e.g. Case reports, Clinical trials), symptoms (e.g.
tics, GI pain, psychological changes, anxiety, ), and other key attributes (I guess dynamic), it must be full text searchable, etc.
I am at the very beginning in this and it is done on a fully volunteer basis.
Lots of questions : is there any scientific/scholar analysis software already available? If yes and is really good and open source , then this will influence the rest of decisions. Otherwise , I'll have to form a team that can write one, in this case I'll have to decide DB, language, etc. I work 20 years with pgsql so it is the natural choice for any kind of data, I just ask this for the sake of completeness.
All ideas welcome.
--
The Wellcome Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.