CSV vs. JSON Lines Files
The dataset builder creates two files:
- A CSV file containing only metadata
- A JSON Lines file containing metadata and the textual data
The textual data includes:
The metadata may include:
Column Name | Description |
---|---|
id | a unique item ID (In JSTOR, this is a stable URL) |
title | the title for the document |
subTitle | the subtitle for the document |
docType | the type of document (for example, article or book) |
publicationYear | the year of publication |
provider | the source or provider of the dataset |
collection | collection informat as identified by the source |
doi | the digital object identifier |
datePublished | the publication date in yyyy-mm-dd format |
url | a URL for the item and/or the item's metadata |
creator | the author or authors of the item |
pageStart | the first page number of the print version |
pageEnd | the last page number of the print version |
pageCount | the number of print pages in the item |
wordCount | the number of words in the item |
pagination | the page sequence in the print version |
language | the language or languages of the item (eng is the ISO 639 code for English) |
publisher | the publisher for the item |
placeOfPublication | the city of the publisher |
abstract | the abstract description for the document |
isPartOf | the larger work that holds this title (for example, a journal title) |
hasPartTitle | the title of sub-items |
identifier | the set of identifiers connected with the document (doi, issn, isbn, oclc, etc.) |
tdmCategory | the inferred category of the content based on machine learning |
sourceCategory | the category according to the provider |
sequence | the article or chapter sequence |
issueNumber | the issue number for a journal publication |
volumeNumber | the volume number for a journal publication |
outputFormat | what data is available (unigrams, bigrams, trigrams, and/or full-text) |
For more detail, see the current version of the schema.
All of the textual data and metadata is available inside of the JSON Lines files, but we have chosen to offer the metadata CSV for two primary reasons:
- The JSON Lines data is a little more complex to parse since it is nested. It cannot be easily represented in a table form in something like Pandas or Excel.
- The JSON Lines data can be very large. Each file contains all of the metadata plus unigram counts, bigram counts, trigram counts, and full-text (when available). Manipulating all that data takes significant computational resources. Even a modest dataset (~5000 files) can be over 1 GB in size uncompressed.
We are still refining the structure of the dataset file. We anticipate adding additional “features” (such as named entity recognition) in the future. Please reach out to Ted Lawless Ted.Lawless@ithaka.org if you have comments or suggestions.
Data Structure
CSV File
The CSV file is a comma-delimited, tabular structure that can easily be viewed in Excel or Pandas.
JSON-L file
The JSON Lines file (file extension ".jsonl") is served in a compressed gzip format (.gz). The data for each document in the corpus is a written on a single line. (If there are 1,245 documents in the corpus, the JSON Lines file will 1,245 lines long.) Each line contains a list of key/value pairs that map a key concept to a matching value.
The basic structure looks like:
"Key": Value
Instead of attempting to decode the structure of a single large line, we can plug a single line into a JSON editor. The screenshot below was created using JSON Editor Online. The JSON editor reveals the file structure by breaking it down into a set of nested hierarchies, similar to XML. These can also be collapsed using arrows in a separate viewer pane within JSON Editor Online.
A single line from a JSON Lines dataset expressed as a nested hierarchy using JSON Editor Online
The editor makes it easier for human readers to discern a portion of the metadata for the text. In the data above, we can see:
- The title is "Shakespeare and the Middling Sort" ("title": "Shakespeare and the Middling Sort")
- The author is "Theodore B. Leinwand" ("creators": ["Theodore B. Leinwand"])
- The text is a journal article ("doctypeType": "article")
- The journal is Shakespeare Quarterly ("isPartOf": "Shakespeare Quarterly")
- Identifiers such as ISSN, OCLC, and DOI
- PageCount and WordCount
If you examine the rest of the file, you'll discover additional metadata such as the publication date, DOI, page numbers, ISSN, and more.
The most significant data for text analysis is usually the "unigramCount" section where the frequency of each word is recorded. In this context, the word "unigram" describes a single word construction like the word "chicken." There are also bigrams (e.g. "chicken stock"), trigrams ("homemade chicken stock"), and n-grams of any length. Depending on the licensing for the content, there may also be full-text available.
On each line, a key on the left is matched to value representing its frequency on the right
The texts have been minimally pre-processed, so casing will affect n-gram counts. Each word here is treated as a string. Since JavaScript and Python strings are case-sensitive, that means that "Tiger" is considered a different word than "tiger". Counting all the occurences of the word "tiger" then would require combining the counts of both strings. These methods are covered in the notebooks.
How does the dataset format compare with JSTOR Data for Research (DfR)?
While the contents of what is delivered to users is very similar, the format of the dataset differs.
The biggest impact on users already comfortable with DfR datasets is the change in dataset delivery. DfR delivers datasets to end users in a ZIP file with the following structure:
- metadata
- ngram1
- ngram2
- ngram3
Each of those directories contained one file per article or book chapter in the dataset. For example, the metadata directory contains one XML file for each document in the dataset and the ngram1 directory contains one CSV file for each document in the dataset -- where on each row is one of the words from the document in the first column and the number of times it occurred in the second column.
In addition to the new JSON format described above, this platform is not doing any preemptive cleanup on the data it delivers, whereas DfR removes stopwords and lowercases all the words in the dataset. Our philosophy is that the researcher will know best how to clean-up his or her own data. In addition, our focus is on learning and teaching and most data in the world is wild and woolly and dirty, so it is best if users learn to do their own clean-up.
How does the dataset format compare with HathiTrust?
They are very similar, and that's not an accident. We worked closely with HathiTrust to develop our data format. Ultimately, we decided to expand on the HathiTrust Extracted Features format to support key features our users sought (for example, the ability to analyze texts at the level of individual journal articles instead of at the issue-level). We are excited by the possibility of including content from the HathiTrust Digital Library in the future.