Start this lesson

This notebook shows how to discover significant words. The method for finding significant terms is tf-idf. The following processes are described:

  • An educational overview of TF-IDF, including how it is calculated
  • Using the tdm_client to retrieve a dataset
  • Filtering based on a pre-processed ID list
  • Filtering based on a stop words list
  • Cleaning the tokens in the dataset
  • Creating a gensim dictionary
  • Creating a gensim bag of words corpus
  • Computing the most significant words in your corpus using gensim implementation of TF-IDF

Use Case: For Learners (Detailed explanation, not ideal for researchers)

Difficulty: Intermediate

Completion time: 60 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: JSON Lines (.jsonl)

Libraries Used:

  • pandas to load a preprocessing list
  • csv to load a custom stopwords list
  • gensim to help compute the tf-idf calculations
  • NLTK to create a stopwords list (if no list is supplied)

Research Pipeline:

  1. Build a dataset
  2. Create a "Pre-Processing CSV" with Exploring Metadata (Optional)
  3. Create a "Custom Stopwords List" with Creating a Stopwords List (Optional)
  4. Complete the TF-IDF analysis with this notebook