CHORD-talk-in-interaction
Data-sharing skills in corpus-based research
on talk-in-interaction

Social interaction is among people. Legal, technical, and ethical explorations about personal information and its removal in talk-in-interaction as data (blog post)

CHORD talk in interaction

28 August 2023

On April 28th, 2023, linguists and interaction scholars gathered at Università della Svizzera italiana to listen to Caterina Mauri, Silvia Ballarè and Lorenza Mondada, who presented and discussed the KIParla corpus of spoken Italian in an open research data perspective (see our report here). The event, which was organised by the CHORD-talk-in-interaction project team, was concluded by an open debate with all participants. In this article, we will pick up some of the issues brought up in that debate and develop them, providing background information for better understanding. Guided by concerns expressed by the workshop participants, we will consider legal, technical, and ethical aspects of the processing of audio and video recorded spoken language as open research data: How are key concepts such as personal information, anonymisation, pseudonymisation and de-identification to be defined? Which methods and techniques are used in interactional linguistics for dealing with data classified as personal? How can research demands, legal requirements and the protection of the research participants’ privacy be harmonised?

On identifiers, pseudonyms, and additional information in the GDPR

Linguists interested in the analysis of spoken interaction work with personal data of the recorded speakers and are inevitably affected by data protection regulations. It therefore comes as no surprise that part of the debate during the workshop directly or indirectly revolved around the General Data Protection Regulation (GDPR) of the European Union. Not only had it impacted the publication of the KIParla corpus in Italy, as reported by the guest speakers; workshop participants from various other countries, Switzerland included, had come across it in their research. Several topics mentioned in the debate – such as the anonymisation of transcripts and audio recordings, informed consent, and consent withdrawal – fall under its scope. It seems worth the effort to briefly enter the legal thicket of European data protection rules before addressing those topics.

The EU’s General Data Protection Regulation came into force in 2018. It governs the processing of personal data related to EU citizens and residents and sets standards for their protection. In Switzerland, the Federal Act on Data Protection (FADP) is in effect, which will be replaced by the new Federal Act on Data Protection in September 2023 – a new edition that will enhance the compatibility with European law, especially with the GDPR. We will concentrate on the latter here.

The GDPR defines personal data as “any information relating to an identified or identifiable natural person (‘data subject’)” (art. 4 §1). The mentioned information includes direct or strong identifiers such as names, identification numbers or location data, but also “factors specific to the physical, physiological, genetic, mental, economic, cultural or social identity of that natural person” (GDPR art. 4 §1), the so-called indirect or weak identifiers. If data of this kind is to be “processed”, which means used in any way, e.g., for linguistic analysis or storage in a corpus, the GDPR requires that the integrity and right to privacy of the person in question be respected. To protect a person’s privacy, data processors can either delete all personal information in a dataset, alter it to such an extent that the inference to the data subject’s identity is no longer possible or ask the data subject for their consent to process their personal data (GDPR art. 6 §1a). We will first explore the legal definitions relevant to the removal or alteration of personal information and come back to the issue of informed consent towards the end of this article.

Removing information that allows to identify a person is the goal of what scholars in linguistics and in the social sciences commonly call anonymisation. A reading of the GDPR reveals the almost total absence of this term, while the regulators frequently refer to pseudonymisation. The two concepts differ in terms of the applicability of the European data protection laws, and their demarcation in the GDPR lies in definitional subtleties regarding the restorability of personal information.

Pseudonymisation falls under the scope of the regulation and is defined as “the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information, provided that such additional information is kept separately and is subject to technical and organisational measures to ensure that the personal data are not attributed to an identified or identifiable natural person ” (GDPR art. 4 §5). “Additional information” typically refers to separately stored decryption keys or lists of aliases that map pseudonyms (false names) to the original identifiers (true names). The demands on effective pseudonymisation are quite high. The data processor needs to assess whether “means are reasonably likely to be used” (GDPR, recital 26) to identify a person, taking into account “all objective factors, such as the costs of and the amount of time required for identification, […] the available technology at the time of the processing and technological developments” (ibid.). If the assessment yields a negative result, privacy can be considered to be sufficiently protected to process the data.

Anonymisation, on the other hand, is not clearly defined in the GDPR, because anonymous data is in fact excluded from the regulation’s scope. In recital 26, anonymous data is described as “information which does not relate to an identified or identifiable natural person“ or as “personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable”. The authors explain that “[t]his regulation does not therefore concern the processing of such anonymous information [...].” Given this definition of the adjective anonymous, the verb to anonymise and the corresponding noun anonymisation (neither of which occur in the GDPR) must be understood to denote a permanent, irreversible removal of all personal information, without the residual risks of restorability that characterise pseudonymisation.

A terminological digression

The interpretation of anonymisation that can be derived from the GDPR’s definition of the underlying adjective appears to be more categorical and restricted than what scholars in linguistics and the social sciences usually intend by the term. In these fields, the word is indeed often employed as a quite generic umbrella term (see for example Mondada 2005 and Baude and colleagues’ 2006 guide of best practices for oral corpora), even in publications that explicitly refer to European data protection laws, such as the Swiss FORS guides no. 11 and 20 (Stam & Kleiner 2022, Stam & Diaz 2023), the Finnish Social Science Data Archive’s article on Anonymisation and personal data or papers about anonymisation measures in specific corpora (e.g., Çetinoğlu & Schweitzer 2022). This is also how the word was used during the workshop debate we are concerned with here.

One reason why linguists and social scientists continue to employ anonymisation as a cover term is probably that it captures the intent behind the so named procedures, which is a perfectly adequate common denominator for this kind of procedures from the viewpoint of who performs them. Admittedly, however, the fact that the desired result is often not obtained is good reason to use the term with caution. How should it be replaced, then? The GDPR refers quite often to pseudonymisation, giving it the status of a prototype within the set of procedures that prevent a person’s identification and thereby favouring a broadened interpretation of the term. Pseudonymisation, however, originally denotes a specific way of manipulating text. Its more general application to a rather diverse set of techniques is little intuitive - especially when no verbally encoded information is involved (as in voice distortion, for example) - and it is hardly suitable as a hypernym.

To generically denominate the processing of data to protect the data subjects’ privacy, an alternative option is to recur to the more recent term de-identification (Italian de-identificazione, French dé-identification, German De-Identifizierung). The word is absent from the European GDPR but seems to be firmly entrenched in a range of sectors globally, for example healthcare and forensic science, especially when big data are concerned, and has found its way into US-American legislation (for example the California Consumer Protection Act). It is both used as a cover term (“the process of removing personally identifiable information from data collected that is stored and used by organizations”, Polonetsky, Tene and Finch 2016, p. 594) and, in some contexts, in a more specific meaning denoting a relatively high “gradation of identifiability” (ibid., p. 595) in which „direct and known indirect identifiers have been either removed or manipulated in a fashion that breaks the linkage between the information and the data subject” (ibid., p. 617), without, however, reaching strict anonymity. It is difficult to predict if the word de-identification will gain ground in qualitative research. It figures as a hypernym in the FORS guide no. 5 (Kruegel 2019) but is definitely rare in linguistics and interaction studies. In the present article, we will use it several times in its broad sense, without any pretence of proposing a perfectly adequate terminological solution.

Removing personal information from transcripts, voices, and videos

We will put the legal and terminological intricacies aside now and have a closer look at various techniques to remove personal information in interaction corpora, their limitations, and the epistemological challenges they raise (cf. also Mondada 2005, pp. 20-27).

Specialists of spoken interaction work with audio and video recordings as primary data, from which they create written, human- and machine-readable transcripts as secondary data. Techniques for de-identifying written transcripts are by now well established in the linguistic community and include typical pseudonymisation procedures such as replacing personal names with aliases while maintaining the original syllable structure of the word or replacing place names with generic surrogates such as “City A” or Street B” (Stam & Diaz 2023). Audio and video recordings, however, add an extra layer of complexity, as not only the content of talk, but also the sound of the human voice or the image of a person’s physical appearance can point to that person’s identity. When sharing audio and video recordings in linguistic research, researchers might consider technical options to alter the voice and physical appearance of data subjects (for an overview of available procedures see the survey published in 2016 by Ribaric and colleagues), while preserving a maximum of information that is of scientific value.

A powerful approach to make speakers unrecognisable in audio recordings is to distort their voice, i.e., to alter the voice’s timbre. To what extent, however, does this method affect prosodic properties of speech that are essential to linguistic analysis? Are separate microphones for each speaker required to obtain satisfactory results? Is an animated conversation still comprehensible when voices are distorted? Still little is known about the technical requirements, feasibility and epistemological implications of voice distortion when applied to conversational data gathered for linguistic or interactional analysis. The existing phonetic and computational literature on voice transformation and on its de-identifying applications (e.g., Tomashenko et al. 2020, Tavi, Kinnunen, Hautamäki 2022) is not directly concerned with these questions and, within linguistics and the social sciences, hardly any research on the subject has been conducted (with rare exceptions such as Henning Pätzold’s 2005 short review of some voice-changers and their applications in qualitative sociolinguistics). In these fields, voice transformation procedures are, in fact, not much used. The KIParla corpus is no exception: When asked about their approach, Caterina Mauri and Silvia Ballarè explained that they decided against altering voices for practical reasons, even though their university’s legal office had pointed out the legal advantages of this technical measure.

As to the de-identification of video recordings, analysts of talk-in-interaction probably use the corresponding techniques more often, at least at the level of single still images or short clips to be included as examples in scientific publications. Scholars face several challenges, though, due to the richness and complexity of visual data. Blurring the faces of interactants, to name a well-known method, is suitable in some cases, but erases multimodal information that is crucial to the analysis of interaction, such as facial expressions and gaze direction. Slightly more information is preserved when outline or negative filters are applied. Besides faces, then, video images may contain more or less indirect identifiers that should possibly be concealed. To take an imaginary example, if a board game interaction between friends were recorded in one of their homes, the display of the rooms might allow certain spectators who possess “additional information” to recognise the speakers even if their voices had been distorted and video effects had been applied.

The discussions about anonymisation (which we called de-identification here) that took place during the USI workshop adumbrated various technical issues and made it clear that a soberly realistic attitude should be maintained at this regard. The existence of diverse indirect identifiers, the manifold possibilities of combining them with further information, and the potential value of some of them for linguistic analysis makes it extremely problematic to completely eliminate personal information from interactional data. “Anonymisation is a myth”, Lorenza Mondada declared, dampening any exaggerated expectation.

On informed consent and access control

In this complex situation, are there ways for specialists of spoken language to conduct their research and share their data in a legally safe and responsible manner?

What is key in this context are the research participants’ right of informational self-determination on the one side and the researcher’s commitment to ethical behaviour on the other. Informational self-determination is a human right and describes “the idea that individuals should be able to decide for themselves when and to what extent their private information can be disclosed to others”, as explained in the FORS guide no. 17 (Diaz 2022, p. 3). It takes priority over data protection laws, which is why regulations such as the GDPR or the FADP permit the processing of personal data when the data subject (here: the research participant) explicitly agrees to it. To ensure that people freely exercise that right when approached to participate in a scientific investigation, researchers need to observe fair, ethical, behaviour. Ethics is that branch of philosophy that describes “morally-based principles of conduct that define what is right or wrong to do independently of or beyond our strictly legal obligations” (Diaz, 2019, p. 4f.); its principles may be formalised in normative documents (for example guidelines), which however remain distinct from laws in several respects (Dubreuil 2011). The consideration of the right of informational self-determination, in combination with the application of ethical rules of conduct, led to the use of consent forms in scientific research.

In linguistics, this practice has become increasingly common during the last twenty years and can be considered well established nowadays (for a detailed overview over best practices see Mondada 2005, pp. 5-19). It is part of the researcher’s toolkit both in corpus projects with a large community of data users, such as KIParla, and in smaller endeavours with few readers and hardly any reuse of the data, such as a modest B.A. thesis. Of course, when researchers intend to share their data with a wider audience, consent forms become a crucial element of open research data management. In the workshop, Caterina Mauri and Silvia Ballarè emphasised their importance when trying to harmonise research demands with the strict requirements of the GDPR.

Consent forms are legal documents in which the researcher, on the one hand, explains the investigation’s goals and procedures as well as the risks (and benefits) of participating in it and, on the other hand, asks the research participant to explicitly agree that their personal data be used for the purposes previously explained. The GDPR as well as the FADP prohibit the processing of personal data without the data subject’s consent, which is defined as the “freely given, specific, informed and unambiguous indication of the data subject’s wishes by which he or she, by a statement or by a clear affirmative action, signifies agreement to the processing of personal data relating to him or her” (GDPR, art. 4 §11). The GDPR furthermore grants the right to withdraw previously given consent (art. 7 §3), a right that should be stated in the consent form. Beyond their legal value and the assurance of the participants’ right of informational self-determination, consent forms are often viewed more generally as a safeguard for ethical research. One important ethical principle requires researchers to protect participants from harm, i.e., physical and emotional pain, but also psychological stress and personal embarrassment or humiliation (see Diaz 2019, p. 14) – including harm that could derive from unauthorised identity disclosure. Alongside with its other functions, the consent form is a means for researchers to express their commitment to observe this ethical principle.

In an open data scenario of linguistic research, researchers must clearly state in consent forms which methods will be applied to remove personal information, with whom data will be shared and which measures will be taken to prevent potentially harmful access. This is what allows participants to express their “specific, informed and unambiguous” agreement to the use of their data, possibly accepting certain risks of being identified by some data users.

Who the data users are, and by which motives they are driven is certainly a crucial question for research participants. A common tool to create a trustworthy research environment is therefore to exercise some control over the group of users who is given access to the data. Arguably, the more problematic anonymisation is, the more important access control becomes. In cases where anonymisation cannot be guaranteed, it may be wise to combine and calibrate de-identifying measures, the participants’ informed consent – including the acceptance of imperfect de-identification – and access control, to find a balance between the protection of data, their openness, and their usability in scientific research.

“[C]alibrating between the three elements means that if you apply more for one element, then you may be able to apply less for another. To take an example, if you determine that your data cannot easily be anonymised, or that you do not have sufficient resources to do so, then you might do less anonymisation but apply stricter controls on access” (Stam & Kleiner, 2022, p. 10).

When building the KIParla corpus, Caterina Mauri’s team proceeded along these lines: all research participants signed an informed consent form that allows for the processing of their personal data, the data was de-identified as far as possible, and access was restricted, differentiating between transcripts and audio recordings. De-identified transcripts are freely accessible via the KIParla website. In contrast, the corresponding audio recordings, in which direct verbal identifiers have been replaced by noise while maintaining the participants’ original voices, can be accessed only by registered users who have an academic affiliation. In case participants withdraw their consent to the processing of their data in KIParla, the audio files containing their voices need to be taken down from the server, whereas the transcripts, which have been stripped of direct and known indirect identifiers, are allowed to remain online.

The technical metaphor of calibration suggests that some craftsmanship is needed to share multimodal interactional data. And indeed, linguists and analysts of interaction need to be aware of relevant laws, know how to approach research participants and de-identify data, be able to fine-tune and control modalities of data access, and skilfully combine various measures to create resources that are both scientifically useful and acceptable for all parties. The increasing institutional demands for open access to knowledge and scientific data, such as the Swiss National Open Research Data Strategy, are currently changing some of the parameters that scholars must take into account in their research design and partly call into question “calibration” practices that have been established in the past. The meeting with the responsibles of KIParla has stimulated a fruitful debate on this subject, which CHORD-talk-in-interaction intends to pursue in further encounters in the next academic year.

Nina Profazi & Johanna Miecznikowski (Università della Svizzera italiana)

References

Papers and guides

Baude, O. et al. (2006). Corpus oraux, guide des bonnes pratiques 2006, Presses universitaires d'Orléans.

Çetinoğlu, Ö, & Schweitzer, A. (2022). Anonymising the SAGT speech corpus and treebank. Paper presented at the Thirteenth Language Resources and Evaluation Conference, 5557–5564.

Dubreuil, B. (2011). Réguler l’éthique par le droit. Klesis – Revue Philosophique, 21, 78-111.

Diaz, P. (2019). Ethics in the era of open research data: some points of reference. FORS Guide No. 03, Version 1.0. Lausanne: Swiss Centre of Expertise in the Social Sciences FORS. doi:10.24449/FG-2019-00003

Diaz, P. (2022). Data protection: legal considerations for research in Switzerland. FORS Guide No. 17, Version 1.0. Lausanne: Swiss Centre of Expertise in the Social Sciences FORS. doi:10.24449/FG-2022-00017

Finnish Social Science Data Archive (FSD). (n.d.). Anonymisation and Personal Data. https://www.fsd.tuni.fi/en/services/data-management-guidelines/anonymisation-and-identifiers/

Kruegel, S. (2019). The informed consent as legal and ethical basis of research data production. FORS Guide No. 05, Version 1.0. Lausanne: Swiss Centre of Expertise in the Social Sciences FORS. doi:10.24449/FG-2019-00005

Mondada, L. (2005). Constitution de corpus de parole-en-interaction et respect de la vie privée des enquêtés : Une démarche réflexive. Rapport sur le projet « pour une archive des langues parlées en interaction. statuts juridiques, formats et standards, représentativité » financé par le programme société de l’Information / archivage et patrimoine documentaire. http://icar.cnrs.fr/projets/corvis/PDF/Mondada05_ethiqueTerrain.pdf

Pätzold, H. (2005). Secondary analysis of audio data. technical procedures for virtual anonymization and pseudonymization. doi.org/10.17169/fqs-6.1.512

Polonetsky, J., Tene, O., & Finch, K. (2016). Shades of gray: Seeing the full spectrum of practical data de-identification. Santa Clara Law Review, 56(3), 593-629.

Ribaric, S., Ariyaeeinia, A., Pavesic, N. (2016). De-identification for privacy protection in multimedia content: A survey. Signal Processing: Image Communication, 47, 131-151.

Stam, A, Diaz, P. (2023). Qualitative data anonymisation: theoretical and practical considerations for anonymising interview transcripts. FORS Guide No. 20, Version 1.0. Lausanne: Swiss Centre of Expertise in the Social Sciences FORS. doi:10.24449/FG-2023-00020

Stam, A., & Kleiner, B. (2020). Data anonymisation: legal, ethical, and strategic considerations. FORS Guide No. 11, Version 1.1. Lausanne: Swiss Centre of Expertise in the Social Sciences FORS. doi:10.24449/FG-2020-00011

Tavi, L., Kinnunen, T., González Hautamäki R., (2022). Improving speaker de-identification with functional data analysis of f0 trajectories. Speech Communication, 140, 1-10.

Tomashenko, N., Srivastava, B.M.L., Wang, X., Vincent, E., Nautsch, A., Yamagishi, J., Evans, N., Patino, J., Bonastre, J.-F., Noé, P.-G., et al., 2020. Introducing the VoicePrivacy initiative. In: Proceedings of INTERSPEECH. Shanghai, China, 1693–1677.

Legal texts

Federal Act on Data Protection (FADP) of 01 July 1993.

https://www.fedlex.admin.ch/eli/cc/1993/1945_1945_1945/fr

Federal Act on Data Protection (FADP) of 25 September 2020.

https://www.fedlex.admin.ch/eli/fga/2020/1998/fr

Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation) (2016). https://eur-lex.europa.eu/eli/reg/2016/679/2016-05-04

California Consumer Privacy Act (CCPA) of 2018.

https://leginfo.legislature.ca.gov/faces/codes_displayText.xhtml?division=3.&part=4.&lawCode=CIV&title=1.81.5

Institute of Italian Studies
Università della Svizzera italiana
West Campus, Main Building
Via Buffi 13
6900 Lugano, Switzerland
tel +41 58 666 42 95
e-mail [email protected]