Dienstag, 12. Mai 2015

Web scraping with R

CO:STA Presentation by Simon Munzert (Universität Konstanz)

The Internet offers a wealth of opportunities to learn about public opinion and behavior of political and other actors. Data from social networks, search engines or web services open avenues for new ways of measuring human behavior and preferences in previously unknown velocity and variety.

Fortunately, the open source programming language R provides advanced functionality to gather data from virtually any imaginable data source on the Web - via classical screen scraping approaches, automated browsing or by tapping APIs. This allows the researcher to stay in one programming environment in the processes of data collection, tidying, analysis, and publication.

The talk gives an overview of web technologies fundamental to gather data from internet resources. Further, we will learn about state-of-the-art tools and packages for web scraping with R. If time permits, we will also discuss subtleties of the web scraping workflow, such as how to ensure reproducibility and to stay friendly on the Web.

Resources:

Barbera, Pablo (2015). Birds of the Same Feather Tweet Together: Bayesian Ideal Point Estimation Using Twitter Data. Political Analysis 23(1): 76-91.

Mellon, Jonathan (2014). Internet Search Data and Issue Salience: The Properties of Google Trends as a Measure of Issue Salience. Journal of Elections, Public Opinion and Parties 24(1): 45-72.

Munzert, Simon, Christian Rubba, Peter Meissner, and Dominic Nyhuis (2015). Automated Data Collection with R. A Practical Guide to Web Scraping and Text Mining. Hoboken, NJ: John Wiley & Sons.