Text as research data

Text is a very common form of research data and remains one of our primary ways of conveying information to one another as human beings. While text is perhaps most frequently used in research within the humanities, a classic text genre, the interview, is widely employed in social science, medical, economic, artistic, and IT-related research.

On this page, you will find information that applies to text-based research. It begins with four case studies of text-based research and research-related activities. These case studies are based on interviews with researchers and staff at a number of institutions with significant experience of working with text. The case studies are presented both as written text and in video format. The content is the same in both formats. Note that the case studies are in Swedish. Following the case studies is an overview of metadata requirements for text-based research. At the bottom of the page, you will find additional information about file formats, data types, and software, complementing what is described in the case studies.

Case studies

The four case studies cover research where text is the fundamental research data. They come from different disciplines and involve various kinds of studies, with diverse aims and methods. As a result, the research data may appear to differ greatly between the case studies, and their curation places different demands on researchers and research data support staff.

Case 1: Language technology and text processing

The goal of research in language technology is to simplify and improve communication between humans and computers, as well as between people. Language technology encompasses processes and software that allow you to dictate text to your computer without using a keyboard, or that enable your text to be automatically proofread by the computer. It also makes it possible to hold a conversation with someone speaking another language, thanks to automatic translation tools. Language technology involves developing tools for, for example, machine translation, speech recognition, text analysis, speech synthesis, and electronic dictionaries.

Researchers working in language technology often have extensive programming skills. Those who do not have such skills may take advantage of the resources made available through, for example, Nationella språkbanken (the Language Bank of Sweden).

The method in this particular case study involves transforming human-readable text into more machine-friendly formats, such as tables or XML. Since this often involves large amounts of text, computers are trained to perform this task automatically. OCR (Optical Character Recognition) software can convert handwritten text into machine-readable form, thereby digitizing the material. Once the texts have been digitized and annotated, they can be subjected to advanced analysis, such as generating concordance tables.

Concordance table
Concordance table.

Curation

Studies in language technology often depend on specialized research software. Such software is usually developed at universities or research institutes, rather than by large software companies. As the purpose of the software is research, functionality is often prioritized over user-friendliness. Sometimes the source code is shared via platforms such as GitHub, but not always, and even if the software is available now, there is no guarantee it will remain so in the future. Ideally, therefore, the data should be delivered together with the tool, or at least with instructions for building a tool capable of handling the data.

The data formats used are often open formats, and generally not particularly difficult to interpret. Researchers frequently use established standards, such as the Text Encoding Initiative, ISO-639 for languages (see below), or Universal Dependencies for marking syntactic relations.

Annotation is crucial information to preserve. Documentation of how the annotation was carried out is essential for assessing the quality of the data and their suitability for different purposes. It is important to specify which standards have been followed, but also which techniques have been used to create the annotation.

Case 2: Comparing African languages

Comparative linguistics focus on identifying how similar or different languages are. The research may be based on printed texts or on linguistic material collected through interviews with speakers of different languages. The focus may be on individual words and their pronunciation, or on grammatical constructions within the language.

The method used in this case study involves collecting information by asking informants to translate a number of selected example sentences into their own language, and to explain why they choose a particular word or word order in a given context. This is called “elicitation,” meaning that the researcher prompts the informants to produce a specific content.

The material may consist of thousands of sentences, which must then be annotated so that the researcher can more easily identify interesting aspects of the data. The sentences are transcribed into text, and each word in the data is annotated with information such as morphology, semantics, and so on. In the completed file, the researcher can then filter sentences with complex combinations of criteria by importing the Excel file into the program OpenRefine (see below).

Curation

Since the data are often table-based, it is easy to store everything in TSV format. Excel files can also contain metadata, entered using the plug-in Colectica for Excel. This information is stored by using Colectica for Excel’s function to create a PDF codebook.

There may also be metadata stored outside the data file itself. Such information may concern the informants and describe, for example, where they grew up, their socioeconomic background, or their parents’ native language. This type of information is important in determining whether a speaker is a reliable source for the language, which may be crucial if two speakers express the same meaning in different ways. It is also valuable to include details about how and when the data collection was carried out, how the speakers were selected, and so on.

Case 3: Distant reading

Distant reading involves analyzing large volumes of text, whether historical or literary sources. Its opposite is close reading, which involves reading a book or a work in detail and carrying out an in-depth analysis of the text. With distant reading, the researcher steps back from the text and looks for patterns across large bodies of literature at once, for example, comparing different authors, genres, or time periods. Distant reading may include, for example:

  • text mining – the process of searching large text collections for the frequency of individual words and how these words have changed over time. It can also be used to study various correlations, such as the historical contexts in which concepts have been used (as reflected in the analyzed texts), which words tend to occur near each other, and so on.
  • topic modelling – a term used for analyses of the thematic structure of texts, based on the study of which concepts are used, in what contexts they occur, and how they relate to one another.

Distant reading with Voyant

In this case study, the researcher uses the web service Voyant. Voyant is free software available online. All that is needed to get started is one or more texts to examine.

The case study focuses on Bram Stoker’s novel Dracula from 1897. The main interface of the web service provides several tools simultaneously. Many of these tools are based on word frequencies, and since the most common words are not particularly useful for most analyses, there is a so-called stop word list where words to be ignored can be specified. Since the language here is English, the stop word list includes words such as a, an, and, are, as, at, be, but, by, for, if, in, into, is, it, no, not, and so on.

Screenshot from Voyant Tools.
Screenshot from Voyant Tools, showing a text segment from Bram Stoker's novel Dracula.

In the top right corner, you see the Trends tool. This tool shows how common certain words are in different parts of the text. On the x-axis, the text is divided into ten equally long sections, while the y-axis displays word frequency. Both van Helsing and Lucy are mentioned frequently up to the seventh tenth of the text. There is then a dip, after which van Helsing and Mina increase in frequency, but not Lucy. The word “vampire” is not mentioned very often at all, and when it does appear, it is almost exclusively in the second half of the text.

Below Trends there is a concordance view showing the contexts in which the word “time” appears. It is one of the most frequent words in the text. To the left, there is a so-called word cloud. The word cloud is essentially a more visually engaging way of displaying a frequency table of the most common words. The larger the word appears in the cloud, the more often it occurs in the text.

Curation

The Voyant web service ignores boldface, typeface, and other formatting. It is therefore practical to save the texts as plain .txt files. When working with published literature, it is not always necessary or legally possible to make the full texts available; instead, it may be sufficient to specify exactly which edition has been used, so that others can replicate the analysis.

The stop word list must be preserved, as it directly influences the visualizations. It is also worth considering whether some of the settings in the Voyant’s different tools should be saved, in order to ensure that the original analysis can be replicated. Researchers using Voyant often do so in an exploratory rather than strictly analytical way and may not always think of saving every single setting separately. To preserve this information, the built-in export function can be used, which makes it possible to export, among other things, a URL that leads back to the exact same view. However, the long-term stability of such URLs is uncertain, and they may not be reliable from a data archiving perspective.

There are many other tools that can be used for distant reading, such as Python and R. Some researchers even develop their own software or visualizations. Important questions to ask in connection with the curation of this type of data include, for example:

  • How was the material compiled?
  • What has been excluded?
  • How was it digitized?
  • What search parameters were used, and what do they not cover?
  • Can the corpus become outdated, and if so, how can it be updated?
Case 4: Student texts

This case study concerns texts written by students as part of their national exams. Such studies may be conducted by researchers investigating, for example, at what ages children acquire different aspects of language, how social and economic contexts influence children’s language learning, or what challenges immigrant children face compared to those who have grown up in Sweden. Researchers may also use this type of material in pedagogy and educational development, for instance when evaluating previous exams in order to design improved exams in the future. Since the research objectives vary, so do the methods, but the data may nonetheless look fairly similar.

Student essays are usually handwritten, and the first step is to digitize them. This can either be done manually, by transcribing the text, or with OCR (Optical Character Recognition), which converts the text into a digital file.

Three photos of student texts: a handwritten essay and the digitized text, plus the researcher’s exam codes.

Researchers then add metadata directly to the text file, such as exam codes (indicating exactly which examination the students in the document took), an identification number for each individual student, and so on. If the dataset is small, researchers can analyze it manually, but if there are large volumes of texts, it becomes necessary to develop a tool to assist the analysis. Metadata about the students is also collected in an Excel file, which may include whether the student attended a public or private school, their native language, and some additional information.

Curation

This type of material can be curated relatively easily. The data are stored in text files, and the metadata in an Excel file. To be on the safe side, it is advisable to save everything as plain text, meaning, TXT form for the essay texts and CSV or TSV for the tables.

However, the coded information in the text files and Excel spreadsheets should be reviewed carefully. The grading system is one example of something that may change in the future, meaning that additional information may be required to explain how the system worked at the time the student essays were written. In the Excel file, Colectica for Excel can be used to document what all the codes represent. Alternatively, all the codes with explanations can be compiled in a separate text file and delivered together with the data files.

Metadata for text-based research

Text can be used in many different ways and for many different purposes. As a result, the issue of metadata can be quite complex. In the interviews conducted for the case studies, researchers mentioned a range of different types of objects that they considered relevant in terms of metadata. Depending on how researchers use text in their work, relevant elements may be found at a low level – such as words, sentences, or characters – at a high level – such as pages, texts, or files – or at an even higher level – such as archives and accessions. It may also be relevant to include information about the individuals who have contributed information and material to the text research, that is, the informants.

Metadata at the lower levels tends to be embedded in the files themselves, for example, annotations of parts of speech, sentence structure, and the like. This is not metadata that you want to extract and store separately from the text data. Higher-level object types can sometimes be described outside the data file itself, which can make the data easier to overview and, in some cases, more searchable. For example, metadata about letters in a collection of text data can add value for the user when information such as sender, year, and recipient is included in a catalogue entry, while information about the fact that the letters contain nouns, verbs, and prepositions may not add much value.

Information that can be valuable for the reuse of data may include, for example:
  • subject/research area
  • geographical information
  • provenance of the text
  • time series
  • accession number
  • metadata about individuals
  • metadata on digitization
  • technical metadata
  • process metadata
  • file structure
  • archives
  • sources of excerpts
  • type of record and recorder
  • text coordinates
  • missing pages, etc.

It is useful for the metadata to specify which object types are relevant for the data being described. However, different object types require different forms of description, and the wide-ranging uses of text data make it impossible to recommend a single standard. In addition, many researchers already work according to standards that are relevant to their particular study, in which case it is sensible to make use of the fact that there already exists information that is structured in a standardized way.

SND’s data description form is largely based on DDI, but researchers should also ensure that the standardized information they have compiled is preserved in its original format, and preferably state in the dataset description that the data follow a particular standard, or possibly several different standards.

 

Examples of metadata standards for text-based research

TEI Header

TEI Header is part of a much larger standard, TEI (Text Encoding Initiative), which is a data format rather than a metadata standard. The TEI Header contains metadata associated with data files in TEI format. A TEI Header consists of a bibliographic description of a file – that is, information that can be used to catalogue or reference the file. It also includes details about how the information in the file has been encoded, for example, whether a text based on a historical manuscript has been standardized in spelling, and so on. It is also possible to embed metadata in other formats within the TEI Header; for example, Dublin Core metadata, which is compatible with most systems and software. Finally, there is a revision history that documents how different versions of the file have changed in relation to one other.

META-SHARE

META-SHARE is a highly detailed standard for describing resources used in language technology, such as speech synthesis or text-to-speech software. These resources may be text-based, but META-SHARE can also be applied to audio, video, software, and other formats. One of META-SHARE’s aims is to enable a computer to determine independently whether a resource described in META-SHARE can be used for a particular purpose. For that reason, META-SHARE contains detailed information about the technical properties of the files being described.

CMDI

CMDI is also designed for language technology resources. It is based on the principle that different organizations should be able to continue using their existing standards while mapping them to a large set of concepts defined in CMDI. In this way, the great diversity of standards – which to some extent is necessary – can be made compatible with one another. This is a promising idea, but the task is highly complex and cumbersome, and therefore CMDI is not easy to work with.

ISO-639

ISO-639 is a set of lists of language names linked to standardized two- or three-letter codes. They are used in a wide range of contexts, but there are certain problems, at least from a linguistic perspective. For example, it can be difficult to distinguish between different dialects or historical stages of a given language. Moreover, there are very many small to extremely small languages in the world that do not have an ISO code. ISO-639 therefore does not cover all the needs one might have, and this remains an unsolved problem, despite various attempts to solve it.

Data types and data formats

Researchers working with text as research data use a wide range of software, depending on the nature of their research. Some programs are simple and widely known, such as Word and Excel, while others may be custom-built and require considerable technical expertise. Many researchers also work across several different software tools.

OCR software converts analogue text into digital text.

There are many different programs of this kind, some specialized in handwriting but most designed for printed text. The output from OCR software can vary; it may be plain text (TXT or similar), but it may also be in an XML format that contains additional information beyond the letters themselves. For example, the output may include information about the position of a character on the page, which allows for a more accurate reproduction of the form of the original text, not just its content.

Examples of software used in text-based research

Corpus Workbench (CWB) is a set of tools for working with large text corpora, meaning those containing many millions of words. These tools are often text-based, without a graphical interface, so basic coding skills are required. Once researchers are familiar with the system, they can gain many new insights into their texts.

AntConc is a concordance tool – essentially an advanced search engine for text. For example, it can be used to investigate the contexts in which a word occurs, or how often it appears together with particular other word. Regular expressions can also be used, which makes very advanced searches possible.

MySQL is a system for managing relational databases. There are many other database systems, but MySQL is widely used. Like Excel, databases are well suited for text where different elements belong to clearly defined categories, such as lists of people, occupations, addresses, and different types of relationships between individuals. With such a database, it might be possible, for instance, to examine how common it is for academics in location A to have a certain type of relationship with people in location B compared to factory workers.

OpenRefine is a tool for working with table-based text data. Its core functions can be used to clean up messy datasets – for example, standardizing different date formats or splitting first and last names into separate columns. OpenRefine is most often used together with other software to process data for a particular stage of analysis, but it can also be used for advanced search and sorting tasks, such as using regular expressions.

Voyant is a web service offering a set of simple tools for text analysis. For example, it can generate word clouds, produce word frequency lists to highlight common and uncommon words in a text, and more.

Oxygen is a program for working with XML files. In language technology, XML-based text formats are commonly used, and researchers and research engineers sometimes work directly in XML files. Oxygen makes it much easier to handle XML formats compared with regular text editors.

File formats

The variety of software also brings with it a variety of file formats. Each format places different demands on those who curate and preserve the files, and the level of complexity can vary greatly. Text-based file formats can often be preserved largely as they are, perhaps with a small amount of metadata explaining how they are structured, which abbreviations are used, and so on. For text files that conform to an open and well-documented standard, even this may not be necessary.

More complex formats, such as database formats and GIS formats, present different challenges. Databases are very practical for those with access to the software needed to use them, but without such software they can be difficult to work with. Moreover, database software is frequently updated, and older versions are not always backwards-compatible. It may therefore be worthwhile to consider exporting the database to a text-based format for safety reasons. If this is done, a database schema should also be created to show how the database can be reconstructed from the text files if required.

 

Preferred formats

Accepted formats

Text document

  • ASCII (.txt)
  • MS Word (.docx)
  • OpenDocument Text (.odt)
  • PDF/A (.pdf)
  • Unicode (.txt)
  • MS Word (.doc)
  • PDF (.pdf)
  • Rich Text Format (.rtf)

Markup language

  • HTML (.html)
  • JSON (.json)
  • XML (.xml)
  • SGML (.sgml)
  • Markdown (.md)