IDLab Refresher Course: Analyzing Text Data in R and Python
Despite the summer holidays, IDLab members and trainees are undergoing advanced training in automated text data analysis
Working in the Lab provides an opportunity to constantly learn new things and improve. This time, IDLab members are updating their knowledge of text analysis in R and Python. The lecturers are Petr Parshakov, Elena Veretennik, Sofia Paklina and Alexey Buzmakov.
The course covers methods of preparation, analysis and visualization of text as data in social (economics & management) research. Within the framework of interactive lessons, the basics of web scraping of texts are presented, tools for preprocessing text data (cleaning, converting to a single format, removing stop words / characters using regular expressions). We will learn to recognize named entities, conduct morphological and semantic analysis of documents. Examples from different datasets (in Russian and English) will present different representations of texts, methods for assessing semantic distance, tools for breaking down texts into thematic groups (clustering, LDA) and subsequent visualization. Using the example of additional training of the BERT class model, the experience of using neural networks for processing textual data, including for the purpose of building predictive models, will be presented.
The first lesson was devoted to web scraping using the example of reviews of Russian banks. Lecturer - Petr Parshakov.