Web scraping technologies in an API world

doi:10.1093/bib/bbt026

Utilize este identificador para referenciar este registo: https://hdl.handle.net/1822/32460

Registo completo

Campo DC	Valor	Idioma
dc.contributor.author	Glez-Peña, Daniel	por
dc.contributor.author	Lourenço, Anália	por
dc.contributor.author	López-Fernández, Hugo	por
dc.contributor.author	Reboiro-Jato, Miguel	por
dc.contributor.author	Fdez-Riverola, Florentino	por
dc.date.accessioned	2015-01-07T13:42:37Z	-
dc.date.available	2015-01-07T13:42:37Z	-
dc.date.issued	2014	-
dc.identifier.issn	1477-4054	por
dc.identifier.uri	https://hdl.handle.net/1822/32460	-
dc.description.abstract	Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.	por
dc.description.sponsorship	This work was partially funded by (i) the [TIN2009-14057-C03-02] project from the Spanish Ministry of Science and Innovation, the Plan E from the Spanish Government and the European Union from the European Regional Development Fund (ERDF), (ii) the Portugal-Spain cooperation action sponsored by the Foundation of Portuguese Universities [E 48/11] and the Spanish Ministry of Science and Innovation [AIB2010PT-00353] and (iii) the Agrupamento INBIOMED [2012/273] from the DXPCTSUG (Direccion Xeral de Promocion Cientifica e Tecnoloxica do Sistema Universitario de Galicia) from the Galician Government and the European Union from the ERDF unha maneira de facer Europa. H. L. F. was supported by a pre-doctoral fellowship from the University of Vigo.	por
dc.language.iso	eng	por
dc.publisher	Oxford University Press	por
dc.rights	openAccess	por
dc.subject	Web scraping	por
dc.subject	Data integration	por
dc.subject	Interoperability	por
dc.subject	Database interfaces	por
dc.title	Web scraping technologies in an API world	por
dc.type	article	-
dc.peerreviewed	yes	por
dc.comments	CEB14738	por
sdum.publicationstatus	published	por
oaire.citationStartPage	788	por
oaire.citationEndPage	797	por
oaire.citationIssue	5	por
oaire.citationConferencePlace	United Kingdom	-
oaire.citationTitle	Briefings in bioinformatics	por
oaire.citationVolume	15	por
dc.date.updated	2015-01-05T20:51:40Z	-
dc.identifier.eissn	1467-5463	-
dc.identifier.doi	10.1093/bib/bbt026	por
dc.identifier.pmid	23632294	por
dc.subject.wos	Science & Technology	por
sdum.journal	Briefings in bioinformatics	por
Aparece nas coleções:	CEB - Publicações em Revistas/Séries Internacionais / Publications in International Journals/Series

Ficheiros deste registo:

Ficheiro	Descrição	Tamanho	Formato
document_14738_1.pdf		548,87 kB	Adobe PDF	Ver/Abrir

Ver registo simples Sugerir correção Estatísticas

Citations

Altmetrics