Spoken language corpora as open research data: the example of KIParla
CHORD talk in interaction
Freely available corpora of spoken language allow linguists and interaction scholars to reuse existing research data, relieving them of the time-consuming task of recording and transcribing talk and allowing for comparative and diachronic inquiries. It is not easy for researchers, though, to find data that precisely match their needs. Existing oral data may be inadequate to respond to a particular research question or to be analysed with a particular method. Further problems are related to data findability (archives with low visibility or obscure search interfaces), accessibility (inactive or discontinued servers), interoperability (diverse file formats) and comparability (inconsistent methodologies or missing metadata).
In 2016, Caterina Mauri from the University of Bologna faced exactly these problems in her research on the linguistic expression of ad hoc categories (LEADhoC). She analysed existing resources for spoken Italian, concluded that it was necessary to gather new data, and... decided to create her own open access corpus! She involved Massimo Cerruti and Eugenio Goria from the University of Turin and Silvia Ballarè, then a PhD student at the Universities of Bergamo and Pavia, all interested in language contact, variation, and grammar. The team started to project a resource that would suit their own goals, but also those of a broader group of users. It would contain varied and recent spoken language data with transparent metadata, provide audio files time-aligned with written transcripts and have an intuitive user interface with advanced search functions. This is how the KIParla corpus of spoken Italian was born.
KIParla was presented and discussed at USI Università della Svizzera italiana on April 28th, 2023. The event was organised by a group of linguists working on data-sharing skills in corpus-based research on talk-in-interaction (CHORD-talk-in-interaction), a project funded by a SwissUniversities Open Research Data grant (ORD). The project group invited Caterina Mauri and Silvia Ballarè (University of Bologna) to talk about their experience with KIParla and had Lorenza Mondada (University of Basel, member of CHORD-talk-in-interaction) discuss their presentation from the viewpoint of the organisers. The event was concluded by an open debate. In this article, we come back to several topics addressed by the invited speakers and the discussant; for more detailed information, we refer the reader to the presentation slides and and to the video documentation of the event.
A salient feature of KIParla, foregrounded by the invited speakers in their presentation, is its modularity. KIParla’s modules are subcorpora of different sizes, compiled by different researchers and for diverse aims, but all of them use the same technical infrastructure, search engine (NoSketch Engine) and data formats to secure the searchability of the corpus across modules. Comparability is ensured by a limited core set of speaker-related metadata (age, origin, profession) and situation-related metadata (place and interaction type) that all modules have in common. The corpus coordinators stressed the advantages of this modular design, which makes it possible to expand the database by incrementally adding modules – also in collaboration with other researchers and institutions. In the near future, the KIParla team plans to integrate data compiled by colleagues from the Free University of Bolzano. Speaking of future perspectives, further plans include lemmatisation (the association of word forms to lemmas) and part of speech tagging (the grammatical classification of word forms).
While building and maintaining the corpus, the KIParla team faced various challenges. Mauri and Ballarè reported, among other difficulties, how the General Data Protection Regulation (GDPR) of the European Union, which came into effect in 2018, had impacted their work. Sharing data from spoken interactions raises privacy concerns, as audio and video recordings contain personal information about the speakers and about the people being spoken of. One of the measures the corpus managers had to take to ensure privacy in compliance with the GDPR was to modify their metadata category “age”, which was originally defined in terms of precise numbers and had to be replaced by more vague age ranges. The GDPR moreover requires a clear attribution of legal responsibilities regarding data protection and ownership; to meet this requirement, it was necessary to draft a detailed joint data controller agreement between the participating universities. The guest speakers pointed out that privacy regulations can change quickly, with sometimes significant consequences for linguistic work. Therefore, they advised to always follow up on data protections laws – ideally with a data protection officer – and to keep all contact information of the involved informants in case they need to be contacted again.
Lorenza Mondada, in her role as discussant, addressed a series of questions related to corpus design, usage scenarios and metadata and drew attention to challenges for corpora posed by the passage of time.
In the process of building a corpus, choices at all levels are not neutral, but based on theoretical and methodological premises, which might not be shared by future corpus users. How did the coordinators of KIParla deal with this dilemma when they designed their corpus to be reusable? In particular, how did they decide how to transcribe the data? Caterina Mauri explained that many fundamental decisions were influenced by the framework of the project within which KIParla was born. The LEADhoC project referred to the fields of language typology, grammaticalisation research, and sociolinguistics, whereas the specific structural complexity of conversation was not in focus. The KIParla recordings were transcribed orthographically in simplified Jefferson style (named after Gail Jefferson, a widely known innovator of transcription practices). These transcripts were then processed automatically, eliminating information about prosody and overlap – a reduction that facilitates the type of corpus searches performed by the team. It is this reduced version that the user sees first when querying the data, whereas the richer transcript can be accessed in a second step if it is deemed necessary for a more detailed analysis. It is certainly impossible to build a corpus and interface that are perfect in absolute terms, Mauri and Ballarè said; what corpus managers can do, however, is to define the door through which users access the data by anticipating some common usage scenarios.
Another major point of discussion was the relation between corpus structure, usage scenarios, and metadata. One approach to building a linguistic database is to design it in advance with the aim of compiling balanced and representative language data, having in mind a clear vision of future user groups and choosing metadata categories accordingly. Another possibility, Mondada explained, is to adopt a more opportunistic approach, accumulating data that are available in a given institutional context. KIParla started off as a balanced, carefully designed project. Since increments are planned, will it become less coherent and more opportunistic in time? Not necessarily, according to Mauri and Ballarè, because KIParla’s modular structure was, from the very beginning, conceived in such a way as to allow for incremental growth. Each module may contain data and descriptors that are not fully consistent with those of other modules, but since all modules share a core set of metadata, the internal coherence of the corpus as a whole will be maintained.
In her concluding remarks, Mondada reflected on what she called the different layers of historicity of corpora, that is, the various effects of passing time on such resources. First of all, language and interaction data are historically situated because recorded at a certain moment in time. This is not always evident in digital corpora, which tend to create an illusion of uniformity that may flatten out the diachronic evolution of data. Secondly, subcorpora are historically situated as transcribed, structured, and annotated data sets created in different places and moments in time. Thirdly, the technical tools employed grow old, which causes dreaded problems related to maintenance and updatability. Finally, institutions, too, are vulnerable to change. The invited speakers and the discussant agreed that researchers seeking to publish a sustainable open access corpus face a plethora of challenges related to all layers of historicity, which call for innovative solutions.
Nina Profazi & Johanna Miecznikowski (Università della Svizzera italiana)
Bosco, C. et al. (2020). KIPoS @ EVALITA2020: Overview of the Task on KIParla Part of Speech Tagging In: EVALITA Evaluation of NLP and Speech Tools for Italian - December 17th, 2020: Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian Final Workshop [online]. Torino: Accademia University Press.
Mauri, C., Ballare S., Goria E., Cerruti M.& Suriano F. (2019) KIParla corpus: A new resource for spoken Italian. In Bernardi R., Navigli R., Semeraro G. (eds.) CEUR Workshop Proceedings, vol. 2481 – 6th Italian Conference on Computational Linguistics, CLiC-it 2019.
Mauri, C. & Goria, E. (2018). Il corpus KIParla: una nuova risorsa per lo studio dell’italiano parlato. In F. Masini and F. Tamburini (eds.), CLUB Working Papers in Linguistics, 2, 96-116. Bologna: CLUB – Circolo Linguistico dell’Università di Bologna.
Mondada, L. (2018). Transcription in linguistics”, in Litosseliti, L. (ed.) Research Methods in Linguistics. London: Bloomsbury, 85–114.
European Union, General Data Protection Regulation (GDPR)
European Union, Your Europe. Data protection under GDPR
Giordano, R., Alfano, I., Parlaritaliano, observatory devoted to the study of Italian speech
Mauri, C. et al.,The linguistic expression of ad hoc categories (LEADhoC)
Miecznikowski, J., Greco, S., Luginbühl, M., Mondada, L., Pekarek, S., Profazi, N. & Rocci, A., Data-sharing skills in corpus-based research on talk-in-interaction (CHORD-talk-in-interaction)
Scuola Normale Superiore, Biblioteca, Corpora della lingua italiana
Sketch Engine, NoSketch Engine (open source version of Sketch Engine)
SwissUniversities, Open Research Data Grants (ORD)