JSTOR

JSTOR is a digital library for the intellectually curious. We help everyone discover, share, and connect valuable ideas.

Content Type: journal articles, book chapters, research reports
Size: over 14 million documents
Document Publication Date Distribution:

Distribution of JSTOR Content in Constellate by Year from 1700-Current

Metadata Quality: High
Text Accuracy: High
Download Availability:

Metadata Unigrams Bigrams Trigrams Full-Text
Early Journal Content (EJC) yes yes yes yes yes
Archive Collections (outside of EJC) yes yes yes yes no
Open Access Books yes yes yes yes yes
Research Reports yes yes yes yes yes
SAOA yes yes yes yes no

Important Considerations: Most of JSTOR’s journal titles are impacted by the “moving wall.” The moving wall is an agreement with the journal publishers on how many years behind the current year JSTOR will remain. For example, if it is 2020, a journal with a 5-year moving wall will only have content through 2015 available in JSTOR.

Portico

Portico works with libraries and publishers to preserve scholarly content.

Content Type: journal articles, book chapters, full books
Size: over 15 million documents
Document Publication Date Distribution:

Distribution of Portico Content in Constellate by Year from 1700-Current

Metadata Quality: Variable
Text Accuracy: High
Download Availability:

Metadata Unigrams Bigrams Trigrams Full-Text
Journals yes yes yes yes no
Books yes yes yes yes no

Important Considerations:

  • Full text from Portico is not currently available for download.
  • Some of the Portico books are made available for text analysis as chapters, however others are only available as full books.
  • Only specific Portico publishers participate.
  • Portico’s content is not impacted by a moving wall. In general, Portico has content preserved within a month or two of publication.

Chronicling America

Chronicling America provides historic newspaper pages from 1789 to 1963.

Content Type: newspaper issues
Size: over 2 million documents
Document Publication Date Distribution:

Distribution of Chronicling America Content in Constellate by Year from 1700-Current

Metadata Quality: High
Text Accuracy: Variable
Download Availability:

Metadata Unigrams Bigrams Trigrams Full-Text
Newspapers yes yes yes yes yes

Important Considerations:

  • The OCR in Chronicling America is highly variable and that variability is not necessarily tied to age (e.g., we have identified very old issues with quality OCR and more recent issues with poor OCR).
  • This is a freely available collection of content. If you want to download all of the content locally and need it in the format we deliver, you are welcome to download it from Constellate in units of 25,000 (or 50,000 if you are at a participating institution). Reach out to us if you hit a limit, so we can discuss how to best meet your needs. You can download all of it in its original format directly from the Library of Congress as well.

Doc South

Documenting the American South (Doc South) is a digital publishing initiative that provides Internet access to texts, images, and audio files related to Southern history, literature, and culture.

Content Type: Documents, Books
Document Publication Date Distribution:

Distribution of Chronicling America Content in Constellate by Year from 1700-Current (note, this is the date of production, not the date of publication)

Size: ~600 documents
Metadata Quality: High
Text Accuracy: High
Download Availability:

Metadata Unigrams Bigrams Trigrams Full-Text
Documents yes yes yes yes yes
Books yes yes yes yes yes

Important Considerations:

  • DocSouth has made four collections available for text mining: The Church in the Southern Black Community, First-Person Narratives of the American South, Library of Southern Literature, and North American Slave Narratives.
  • This is a freely available collection of content. If you want to download all of the content locally and need it in the format we deliver, you are welcome to download it from this platform in units of 25,000 (or 50,000 if you are at a participating institution). Reach out to us if you hit a limit, so we can discuss how to best meet your needs. You can download these four collections in their original format directly from Documenting the American South as well.

Your Own Institutional Content

We have gotten requests for institutions to load their own content into the platform to make available either to just their constituents or widely to anyone using the platform. This is a feature we are considering building into our advanced tier of service. If this is of interest to you, please contact us, we would love to brainstorm with institutions about this possibility.

Alternatively, if you are an individual with a collection of content you hold locally, you may load it into our Analytics Lab to work with on its own or side-by-side with datasets built on our platform.  Please note that our Analytics Lab instances are ephemeral and should not be considered to have any permanency (e.g., if you load content and then walk away for 10 minutes, your session will have been discontinued and you will need to start a new session and re-upload your content).

Suggestions

We would like to continue to increase the variety and amount of content available for analysis.  If you have specific requests, please let us know at tdm@ithaka.org.