(Related posts: An intro to topic models for text analysis, Making sense of topic models, Overcoming the limitations of topic models with a semi-supervised approach, Interpreting and validating topic models and Are topic models reliable or useful?)
As I’ve explained in a series of previous posts on this blog, topic models are an exciting type of algorithm that offer researchers the ability to rapidly identify (and potentially measure) the themes hidden within large amounts of text. In a 2018 Pew Research Center study, for example, my colleagues and I attempted to use topic models to measure the prevalence of notable themes in a set of open-ended survey responses about the things that provide meaning in Americans’ lives.
In my previous post in this series, I described how we interpreted the topics in our 2018 study by giving them each a label, such as “being in good health.” We then wanted to compare the models’ decisions with the labels that our own researchers assigned to the survey responses. But before we could do that comparison, we first had to make sure that our labels were clear enough for real-life people to interpret and assign in a consistent way.
For each topic, two researchers read through a random sample of 100 responses and flagged those that matched the label we had given the topic. We then used an inter-rater reliability metric — called Cohen’s Kappa — to compare our researchers’ labels and determine whether they agreed with one another at an acceptable rate. Seven of our 31 labels resulted in poor reliability (Kappas under 0.7), and we had to set them aside. But the majority of the topics showed promising reliability scores.
We weren’t quite done yet, though. To make sure that our high levels of agreement for these topics weren’t just the product of lucky samples, we also used a statistical test to confirm that all of our reliability estimates met or exceeded our minimum reliability score at the 95% confidence level. Not all of our topics passed this test, sadly.
Reaching this level of confidence depended on three different factors: the size of the sample, how frequently the topic appeared and how well the researchers agreed with each other. Despite our high levels of agreement, some of our topics simply appeared too infrequently in our small random samples, requiring us to expand the samples and code more responses to establish adequate confidence. Whether or not we could salvage these topics depended on how many more responses we would have to label and how much time that would take us.
Using each topic’s initial random sample of 100 responses, we were able to get a rough baseline of the topic’s prevalence in our data and compute a preliminary Kappa for the topic. We then used these values to calculate how much larger the sample would have to be to confirm the Kappa we were observing with 95% confidence.
For some topics, we only needed to code a few additional responses to reach our desired confidence, so we simply expanded the existing random samples with more responses and were able to confirm that the researchers reliably agreed with each other. But for a handful of rarer topics, it looked like we would have to code thousands of more cases, or somehow improve our already-high agreement — neither of which was realistic.
Enter keyword oversampling
To overcome this problem, we used a technique called keyword oversampling. Instead of coding a larger random sample for topics that were present in only a small fraction of responses, we drew new samples. These new samples were disproportionately composed of documents containing a set of keywords that we suspected were related to the topic label we were trying to code: 50% with one or more of the keywords, and 50% without.
In this case, coming up with lists of keywords was easy, as we had already developed a set of anchor words for each topic. We could assume that around half of the responses in these new samples would correctly contain our target topics and that our level of reliability was likely to be similar to our initial random samples. Using these assumptions, we could roughly estimate the number of responses that we needed to label in each of the new keyword-based samples to verify our reliability with confidence. This time, because we expected to have far more positive cases in this new set of responses, we didn’t have to code nearly as many documents as we would have with a completely random sample.
After coding the new samples and computing Cohen’s Kappas using sampling weights that properly penalized false negatives, we were able to confirm with 95% confidence that we had clear and interpretable labels for an additional seven topics that we might otherwise have had to set aside.
The benefits of keyword oversampling
Keyword oversampling can be a powerful way to analyze uncommon subsets of text data.
In several past Pew Research Center projects in which we analyzed congressional press releases and Facebook posts, we were interested in measuring expressions of political support and disagreement using supervised classification models. Doing this successfully required us to collect and label a large set of training data. But since partisan statements are just one of many different types of statements that legislators make, we were concerned that we would need to label an unrealistically large random sample to ensure that it contained enough relevant examples to train an accurate model.
Instead of sampling randomly, we brainstormed a list of keywords that we expected to be related to statements of support and disagreement — words like “applaud” and “disapprove,” and the names of common political figures like “Obama” and “Trump.” We then extracted a training dataset with a disproportionately greater number of posts containing these terms.
Even though we used sampling weights to account for the oversample by proportionally reducing the importance of posts containing the keywords, the additional positive cases in our sample improved the performance of our models substantially. Without this sampling technique, training the classifiers would have required much larger (and more expensive) random samples.
In the case of our open-ended survey responses, keyword oversampling allowed us to verify our inter-rater reliability for seven rarer topics that we might otherwise have been forced to abandon, leaving us with a total of 24 topics for which we had clear, agreed-upon definitions. In the process, we also produced an expert baseline for each topic that could be compared to the output of the topic models. This allowed us to test whether or not the models could actually be used to measure the topics in a way that agreed with our own interpretations. In the handful of cases where our researchers’ labels didn’t align with one another, we discussed the disagreements and resolved them, leaving us with a human-verified ground truth for each of our topic labels.
After comparing these baselines to the labels assigned by the topic models, we were pleased to find that the models could reliably emulate our own decisions for most of the topics. However, we were also surprised to find that an even simpler method outperformed the models in nearly every case. In my next and final post in this series, I’ll reveal what we discovered and discuss what we learned about the usefulness of topic models more generally.
To see an example of keyword oversampling in action, you can check out this Jupyter notebook on Github.