CUSO Competence Workshop“Data analysis in R”
June 8–9, 2018
Université de Neuchâtel



Stefan (Material & GitHub)

Ulrike (Cart) or Dropbox link

Dylan (MCA) or website link

Joe (GitHub link)

Worshop description

The workshop is aimed at all PhD/MA students in linguistics, Digital Humanities, and related fields with general and specific questions about how to explore or analyze their data. It is not tied to a specific language, field, or data type.

Unlike many workshops that introduce statistical methods and exemplify them by means of someone else’s data or toy sets, the idea of this interactive workshop is that you bring your own data. Challenge our expeRts with questions on how to process, analyze, visualize, or document your data! The aim is to train your skills in independent analysis through hands-on discussion of your project with our experts as well as your peers.

On DAY 1, our experts give tutorials in their field with exploratory pointers. We switch the workshop mode for DAY 2: After a session with participants’ elevator-pitch presentations (max. 2mins), you consult with our experts and discuss your data in small groups. Our speakers are experienced in data analysis and statistical methods (data processing, annotation, extraction, summary statistics, regression, classification, etc.).

To get the most out of the workshop, please follow the instructions below. If you have questions, do not hesitate to contact me at

Some people have asked if the workshop is suitable for PhD candidates early in their doctorate or MA students. Yes! Do come along, and bring whatever you already have — even if they are only questions, early corpus data, or small samples. Describe your (prospected) project in the pre-workshop survey and we can arrange a consultation session on such issues.


1. Please register through the CUSO website if you are a PhD student at a CUSO-affiliated university.

2. In addition, please complete this pre-workshop survey until June 1. It includes a self-assessment of your R experience and will help us get to know you, your project, and what you would like to get out of the workshop. (If you register later than June 1, please fill out the form ASAP after registering.)


1. Please go through my R basics tutorial on YouTube (8 short clips, YouTube playlist).

2. For the elevator-pitch presentation, please prepare ONE PowerPoint slide to present your project in no more than 120 seconds, addressing (i) your project/question, (ii) data type, and (iii) problems/issues. Keep your slide as accessible as possible; it’s meant to give everyone an overview of what you do and the questions you are dealing with. Send your slide to no later than June 7, 2018. The earlier you send it, the better I can provide feedback.



Salle RS.38Faculté des lettres et sciences humaines (FLSH)Espace Louis-Agassiz 1Université de Neuchâtel

09.30–09.45 Welcome
09.45–11.30 Stefan Hartmann (Bamberg): Data visualization using R
11.45–13.30 Ulrike Schneider (Mainz): CART trees and random forests
13.30–14.30 Lunch
14.30–16.15 Dylan Glynn (Paris): Using FactoMineR / explore and ca in R
16.30–18.15 Joseph Flanagan (Helsinki): Reproducible research with R


09.30–10.15 Elevator-Pitch
10.15–10.45 Coffee break
10.45–12.30 Consultation 1
12.30–14.00 Lunch
14.00–15.30 Consultation 2
15.45–17.15 Consultation 3
17.15–17.30 Closing/farewell
17.30 Socializing/farewell drinks
Session abstracts (more to follow)

Stefan Hartmann: Data visualization Using R

Visualising data is usually the first step in data analysis. Arguably it is also the most important one. In this course, we will discover how simple scatterplots, line plots, boxplots, and barcharts can be created in R. Using hands-on examples, we will learn what we have to keep in mind when preparing the data, which plot types can be used for which data types, and how plots can be efficiently modified for different purposes (e.g. with and without color, with bigger and smaller axis labels, etc.). In addition, we discuss “best practices” of data visualization and how to avoid pitfalls like overplotting.

Ulrike Schneider: CART Workshop

Classification and Regression Trees (CART trees) and random forests are methods in analytical statistics in which datasets are recursively split into subgroups which are more homogeneous than their parent groups. This procedure produces handy visualisations of the data, which can be statistically evaluated. Additional benefits of the procedure are its ability to cope with multinomial outcomes, complex interactions, large numbers of predictors and with skewed distributions (e.g. a few high-frequency types but also many hapaxes). In the workshop, we will use the party package in R to take first steps towards analysing your data with the help of trees/forests. I will show you how to generate a tree and how to interpret it using lots of examples. We will also look at some potential pitfalls when using trees and when it is useful to ‘upgrade’ to forests (which we will also generate and interpret).

Dylan Glynn: Multiple Correspondence Analysis

This is an exploratory statistical technique designed to help one identify patterns in complex categorical data. Although this is a reasonably restricted use, for those of us working in corpus linguistics, sociolinguistics or discourse analysis, it is often exactly the kind of tool we need. It works by looking for associations and dissociations across our different variables and visualising those relations in 2-dimensional plots. The analysis is extremely simple to apply. The main difficulties lie in interpreting the resulting visualisations. This difficulty is due to the complexity of the structure the plots are trying to represent.