Open this Research Notebook ->

This notebook takes as input:

  • Plain text files (.txt) in a zipped folder called 'texts' in the data folder
  • Metadata CSV file called 'metadata.csv' in the data folder (optional)

and outputs a single JSON-L file containing the unigrams, bigrams, trigrams, full-text, and metadata.

It allows researchers to create a dataset compatible with other notebooks on this platform. The NLTK tokenization method in this notebook is slightly different from how other documents are tokenized in the Constellate dataset builder. If you want to combine with an existing dataset from the builder, use the Tokenizing Text Files notebook.

Use Case: For Researchers (Mostly code without explanation, not ideal for learners)

Difficulty: Advanced

Completion time: 10-15 minutes

Knowledge Required:

Knowledge Recommended:

Data Format: .txt, .csv, .jsonl

Libraries Used:

  • os
  • json
  • NLTK
  • gzip
  • nltk.corpus
  • collections
  • pandas

Research Pipeline:

  1. Scan documents
  2. OCR files
  3. Clean up texts
  4. Tokenize text files (this notebook)