Please use this identifier to cite or link to this item: http://hdl.handle.net/1822/32460

TitleWeb scraping technologies in an API world
Author(s)Glez-Peña, Daniel
Lourenço, Anália
López-Fernández, Hugo
Reboiro-Jato, Miguel
Fdez-Riverola, Florentino
KeywordsWeb scraping
Data integration
Interoperability
Database interfaces
Issue date2014
PublisherOxford University Press
JournalBriefings in bioinformatics
Abstract(s)Web services are the de facto standard in biomedical data integration. However, there are data integration scenarios that cannot be fully covered by Web services. A number of Web databases and tools do not support Web services, and existing Web services do not cover for all possible user data demands. As a consequence, Web data scraping, one of the oldest techniques for extracting Web contents, is still in position to offer a valid and valuable service to a wide range of bioinformatics applications, ranging from simple extraction robots to online meta-servers. This article reviews existing scraping frameworks and tools, identifying their strengths and limitations in terms of extraction capabilities. The main focus is set on showing how straightforward it is today to set up a data scraping pipeline, with minimal programming effort, and answer a number of practical needs. For exemplification purposes, we introduce a biomedical data extraction scenario where the desired data sources, well-known in clinical microbiology and similar domains, do not offer programmatic interfaces yet. Moreover, we describe the operation of WhichGenes and PathJam, two bioinformatics meta-servers that use scraping as means to cope with gene set enrichment analysis.
TypeArticle
URIhttp://hdl.handle.net/1822/32460
DOI10.1093/bib/bbt026
ISSN1477-4054
e-ISSN1467-5463
Peer-Reviewedyes
AccessOpen access
Appears in Collections:CEB - Publicações em Revistas/Séries Internacionais / Publications in International Journals/Series

Files in This Item:
File Description SizeFormat 
document_14738_1.pdf548,87 kBAdobe PDFView/Open

Partilhe no FacebookPartilhe no TwitterPartilhe no DeliciousPartilhe no LinkedInPartilhe no DiggAdicionar ao Google BookmarksPartilhe no MySpacePartilhe no Orkut
Exporte no formato BibTex mendeley Exporte no formato Endnote Adicione ao seu ORCID