Constellate provides a number of dataset options:

  1. Datasets in CSV of just bibliographic metadata. We allow individuals to download datasets of 25,000 items (or 50,000 if your institution participates in our beta program) within the Constellate dataset builder.  We cap creation at 10 of these datasets a day per user, but you are welcome to come back a few days in a row to pull together a large dataset on your own.
  2. Datasets in JSON of bibliographic metadata,  unigrams (the unique set of words in the texts and their frequency of occurrence), bigrams (the two word phrases in the texts and the frequency of their occurrence), and trigrams (the three word phrases in the texts and frequency of their occurrence).  The self-service datasets are limited to 25,000 items (or 50,000 if your institution participates in our beta program).  We cap creation at 10 of these datasets a day per user, but you are welcome to come back a few days in a row to pull together a large dataset on your own.
  3. Datasets in JSON of full-text for any open content in Constellate.  This includes content from Reveal Digital, Chronicling America, Documenting the American South, early journal content from JSTOR (pre-1924), and open access books from JSTOR.  It does not include any of the content from Portico.  This full-text is automatically delivered through the Constellate interface when you download a dataset that contains open items (note, the full-text makes these datasets very large, from a gigabyte point-of-view).
  4. Datasets in CSV of sentences.  In this scenario, you build a dataset in Constellate, provide the Constellate team with the dataset ID or search query and a term or regular expression to match, and we build and deliver a dataset of sentences that contain that term or regular expression from within the dataset you configured.  We can include sentences from any document from any data source so long as the document is longer than 10 sentences and the number of sentences that match the supplied string from each document is 10% or less of the total sentences in the document.  (For example, you could build a dataset of performing arts content and we provide to you all the sentences that contain the word “experimental” from that dataset.)
  5. Datasets in JSON of full-text for rights restricted content.  For JSTOR content only, you may sign a researcher agreement that will allow us to provide the full-text of rights restricted content to you.  If you need a dataset of the full-text of JSTOR content, please fill out the Data for Research request form.

You may learn more about the format of the datasets you can download straight from Constellate.

If you have thoughts on other useful datasets we could provide, please let us know.