Brandeis Design and Innovation

Scraping Text Data

What is scraping?

Scraping is an automatic process for grabbing text and other data from websites. You can write a code (often using Python programming language) and command your computer to scroll through a website and extract certain information – for example, scrolling through Twitter and copying/pasting Tweets about “climate change” into an Excel spreadsheet, plus pertinent information like publication date, author name, and author location. 

Learning Python takes time. Building a customized scraper is not easy. These are great skills to have and you can do so for free at Brandeis. Weigh all of the various options with your project timeframe and long-term goals. If you decide that Python is not for you, you have some options:

  1. Copy + Paste aka Old Reliable

Depending on how many Tweets/reviews you need, consider manually copying and pasting the data into an Excel spreadsheet. 

  1. Chat GPT

Chat GPT is a useful programming shortcut (assuming you’re being responsible and this resource falls within the parameters of your project). There’s definitely a learning curve, but you might be able to command Chat GPT to scrape for you.

  1. Streaming Tools

Certain online platforms pull directly from Twitter and create visualizations about those results:

  1. TweetSets

TweetsSets is a great option for beginners interested in using Twitter data, and it legally shares Tweets via ID. This website is managed by GW University Libraries. No coding experience is necessary. Because these are pre-made datasets, you will be limited to Tweets with certain subjects and dates.

Choose a topic of interest (e.g. “climate change”, “2016 election”, etc.) and from there, filter out results based on particular keywords, dates, etc. In order to adhere to Twitter’s policy on Content Redistribution to Third Parties, TweetSets will provide you with a file of Tweet IDs. Using DocNow’s Tweet Hydrator, join these ID numbers to ID numbers on Twitter. The resultant Excel file lists usernames, tweet contents, hashtags, dates, etc.

Here is an excellent tutorial created by the Programming Historian

First, you’ll need to know how to code, and Python is common for scrapers. Brandeis offers free training options, including asynchronous tutorials, synchronous workshops, and full courses. Contact Ford Fishman for more information. 

In terms of building the actual scraper, GitHub is a great resource for finding open access codes written by other programmers. Check out this example created by programmers at Microsoft

It’s your responsibility to research and understand website policies. You are currently reading a non-affiliate, information resource for academics and students interested in Digital Scholarship. Protect yourself and your project – do your research and reach out to whatever website/company you are working with. 

As of Spring 2023, you can use Twitter data for a text analysis and share your results online or in a written publication (e.g. write about or create a graphic describing different trends). 

You cannot distribute scraped Tweets. In other words, don’t post the Excel database of Tweets on a research website, or similar platform for others to see and download. This is to protect privacy. Read Twitter’s policy on Content Redistribution to Third Parties to fully understand these rules, as well as potential workarounds if you’re an academic researcher or educator. 

You can distribute a dataset which only lists Tweet IDs. Tweet IDs are unique identifiers tied to usernames, Tweets, and direct messages. With a list of Tweets IDs and a hydrator tool (e.g. DocNow’s Tweet Hydrator) another person can reproduce your dataset. This is legal and a good option for individuals who want their raw data (i.e. their Excel database) accessible for others to review/reuse. 

When you rely on Tweet ID datasets posted online, your options are limited. You might have specific key terms or a timeframe that has not been captured by TweetSets or other similar resources. Webscrapers are tools (most often built in Python) that will automatically search for your key term(s) on a website and store it in a format you specify. They can accommodate complex search-strings, integrating hashtags, exclude certain phrases, isolate Tweets from specific regions or within a particular timeframe.