The Best of both Worlds - Combining Human and Automated Coding
Human coding of large text corpora is an important, but very labor intensive and time-consuming pillar of political science research. Consequently, the rise and constant improvement of automated classification models presents itself as a perfect possibility to alleviate this burden, by enabling researchers to execute large coding tasks within a fraction of the time and work typically required. However, the reliability, consistency and quality of automated coding remains a concern. Can the computer really take over from humans and how comparable will the results be? This question is especially important for long-standing human coding projects like the Manifesto Project that do not only need to ensure reliable results at a certain time point but over time.
In this project, we examine and discuss the whole integration process of an automatic text classification model into the fine-grained coding procedure of the Manifesto Project. We 1.) compare the performance of a range of different text classification methods on the manifesto classification task, including traditional bag-of-word models, word-embedding feature approaches and transformer based models. Resulting from the comparison, we 2.) establish a new state-of-the-art model approach for the manifesto classification: a shared layer XML-RoBERTa model, which utilizes the context of the sentence to be classified. In light of the capabilities and remaining shortcomings of the model, we 3.) discuss the potential benefits and limitations of automated coding approaches for human coding tasks. Three application areas are especially considered: Different forms of Quality control, (partially) automatic labeling of whole manifestos and code suggestions/pre-selections for human coders. Furthermore, we 4.) also use this knowledge to explore and test avenues for the application of automated classification models across different types and domains of political text (e.g., manifestos, parliamentary speeches, press releases, tweets), potentially extending the manifesto coding scheme to new uncovered data.