/ News, Research, Continuing Education

Little helpers: Web scraping

Several friendly little helper monsters scraping webs. Generated by Adobe Firefly (beta)

In the first part of our "Little Helpers" series, we show how to download data from websites automatically on a large scale.

Downloading data from the web is sometimes the only way to get research sources. When dealing with dozens or hundreds of documents, manual downloading is a time-consuming, mindless and error-prone process. So-called web scraping can help download such resources quickly and automatically. However, modern websites hardly allow classic scraping anymore. For example, the pages are rendered with JavaScript or similar and the effective HTML page is just a scaffold without page content. One solution to this is to use a "remote browser" that calls up pages, clicks on fields there, fills them in and sends queries. This makes it possible to write small scripts that, for example, enter a search term and then automatically download the results and store them locally.

This feature is provided by the Selenium WebDriver"Selenium with Python" allows you to call the Web Driver from a script and perform automatic web scraping. For tech-savvy and experimental researchers, RISE has put several examples of this in its GitHub repo.

The time savings that can be achieved with this, especially for larger collections, are enormous.

We are happy to advise members of the University of Basel on how to adapt the relevant scripts and embed them in a broader data management strategy. No prior technical knowledge is necessary for a consultation. We look forward to your inquiry.