Start this lesson

This notebook shows how to find the most common words in a
dataset. The following processes are described:

  • Using the tdm_client to create a Pandas DataFrame
  • Filtering based on a pre-processed ID list
  • Filtering based on a stop words list
  • Using a Counter() object to get the most common words

Difficulty: Intermediate

Completion time: 60 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: JSON Lines (.jsonl)

Libraries Used:

  • tdm_client to collect, unzip, and read our dataset
  • NLTK to help clean up our dataset
  • Counter from Collections to help sum up our word frequencies

Research Pipeline:

  1. Build a dataset
  2. Create a "Pre-Processing CSV" with Exploring Metadata (Optional)
  3. Create a "Custom Stopwords List" with Creating a Stopwords List (Optional)
  4. Complete the word frequencies analysis with this notebook