This introduction explains the various kinds of text analysis methods for a business and data science audience. What are they? Why would you use them? How long will it take to apply them? (The methods presented here are among the most well-known but certainly not exhaustive.) Afterward, you'll be better-prepared to decide how much or how little text analysis may be useful for your work.
ā
There are five main questions that text analysis can help answer:
- What are these texts about?
- How are these texts connected?
- What emotions (or affects) are found within these texts?
- What names are used in these texts?
- Which of these texts are most similar?
Question 1: What are these texts about?
- Word Frequency (Beginner)
Counting the frequency of a word in any given text. This includes Bag of Words and TF-IDF. Example: "What words are most common in customer support tickets?" - Collocation (Beginner)
Examining where words occur close to one another. Example: "When people mention our premium product, what do they say about the packaging?" - Topic Analysis (or Topic Modeling) (Intermediate)
Discovering the topics within a group of texts. Example: "What are the most frequent topics discussed in five years of email from our advertising department?" - TF/IDF (Intermediate)
Finding the significant words within a text. Example: "Given a decade of board reports, are there seasonal issues that crop up in summer vs. winter?"
Question 2: How are these texts connected?
- Concordance (Beginner)
Where is this word or phrase used in these documents? Example: "Show me every email where someone mentions our least visible product." - Network Analysis (Advanced)
How are the authors of these texts connected? Example: "Given email data, how often does marketing connect with engineering?"
Question 3: What emotions (or affects) are found within these texts?
- Sentiment Analysis (Intermediate)
Does the author use positive or negative language? Example: "How do our customers feel about our new product line?"
Question 4: What names are used in these texts?
- Named Entity Recognition (Intermediate)
List every example of a kind of entity from these texts. Example: "What are all of the geographic locations mentioned by our users?" - Removing Sensitive Information (Intermediate)
Remove sensitive or personally identifiable information (PII) from data for archiving. Example: "Since our founder is retiring, we want to preserve his business emails." "We want to save user data without linking it to their identities."
Question 5: Which of these texts are most similar?
- Clustering (Advanced)
Which texts are the most similar? Example: "How does our help documentation compare with that of our competitors?" - Supervised Machine Learning (Advanced)
Are there other texts similar to this? Example: "Given these examples of accessible content, can we identify where our content is not accessible?" "Given user search data, can we predict user search terms?"