Brandeis Design and Innovation

Cleaning Text Data

Unless you’re using a tool like Constellate or Scopus, most text analysis tools will require you to upload your own dataset to analyze. Before any analysis can take place, you’ll need to clean (i.e. prepare) your dataset. Before you can clean, you’ll need to have some idea about your analytical methodology (e.g., what tools you will use). In other words, text analysis is an iterative process, so it’s a good idea to do some research before diving in.

  • Correcting any errors or inconsistencies in spelling that could skew results
  • Using the correct file format (e.g. xlsx, pdf, etc.)
  • Creating rows/columns so that the tool can identify and compare different information
  • Data formats depend on what software/tool you are using. Verify what data formats are acceptable. It’s probably going to be a tabular format.
  • Text analysis software generally cannot read handwriting/images of text. If you’re working with a scanned document, you will need to transcribe it.

FYI: OCR Machine Learning is a way for computers to recognize handwriting and transcribe it for you. This is an advanced-level technique that requires coding experience. If you’re interested, Brandeis does offer Image Recognition workshops.

“Cleaning” means prepping the dataset — removing extraneous rows and columns or information; deleting problematic icons and symbols; reformatting dates and times; adding column headings. If you need help, refer to this guide: Transforming Humanities and Social Sciences Data into a Spreadsheet.

Not sure where to start? Try uploading your dataset to whatever text analysis tool you chose. Your tool might “analyze” text you’re not interested in (e.g. page numbers, words like “chapter”, etc.).

If you intend to compare texts, you’ll need to strategize how to distinguish them. Should you use one Excel file and each row is a different text? Or should each text be saved in a separate file? It all depends on what tool you choose, and you might need to experiment a bit. 

Consider this example scenario:

I searched for tweets about “climate change” that were published from Waltham, MA in 2020. I found 100 tweets. I wanted to know: how many of these tweets were using negative-sounding language? 
I created an Excel file. There were two columns: Tweet and ID. Every row recorded a tweet and assigned a unique ID number.

I decided to use the platform Voyant. When I uploaded my spreadsheet, I discovered a problem: Voyant doesn’t let you compare between authors. Even though my spreadsheet distinguishes between each of the Tweet authors, Voyant just takes all of the text in the Excel file and lumps it together.

If I want Voyant to identify and compare across different authors, I would need to save every tweet (i.e. every row) in a separate Excel file and upload them all. That’s pretty clunky; the platform doesn’t make it easy to read the results.

So, I tried a different tool: OrangeOrange supports comparing trends between texts. 

I know my analysis will use R, Stata, or Atlas.ti

If you know you’re going to be working with one of these programs, please reach out to the Data Services Team.

Learn how to choose an analytical tool/interface for your text dataset.