Our recent report on podcast guests involved a fairly daunting research challenge: identifying all the guests who appeared on roughly 24,000 podcast episodes in 2022, based solely on the episode descriptions. We’re always on the lookout for more efficient ways of doing our work, so we decided to see if the newest generation of large language models, or LLMs – the technology behind popular tools like ChatGPT – could help us.

Related: Most Top-Ranked Podcasts Bring On Guests

In this Q&A, we talk with the researchers who worked on the analysis – computational social scientists Galen Stocking, Meltem Odabaş and Sam Bestvater – about how they approached this task, how it worked, and what they learned about using the new generation of LLMs for research purposes.

How would you typically approach a research project like this?

Galen Stocking, senior computational social scientist

Stocking: With a large data source like this, our first step is typically to ask whether we can identify what we’re looking for in an automated way. Until recently, on a project like this, that might mean using a script to search the episode descriptions for a list of known names or trying to train a specialized machine learning model to identify patterns in the text that signify names.

These are tried-and-true approaches that we’ve used before, but they would not have worked well in this case. The episode descriptions contain tens of thousands of names – not just hosts and guests, but also the names of people who are in the news or who are discussed but don’t appear on the show. They are also not spelled or presented in a consistent format.

Instead, we likely would have had to tackle this problem by training human coders to read each of the 24,000 episode descriptions and note which guests are mentioned.

But doing this with humans is a lot harder than it sounds. The job is boring and repetitive, so it’s easy for people to lose focus. And when people lose focus, they can miss important information we want them to be attentive to.

What’s more, there are a lot of episode descriptions to read and code. If we assume it takes two minutes for someone to read each episode description, it would take a team of five workers about a month to go through the entire list if working eight hours a day, five days a week without any breaks.

How do large language models help you solve those problems?

Samuel Bestvater, computational social scientist

Bestvater: You might know about large language models from using OpenAI’s popular ChatGPT chatbot. These models can generate natural-sounding text and carry on conversations because they are trained on large amounts of data to build an advanced internal representation of human language.

Those same traits also make these models good at processing and understanding text that’s already been written – like podcast episode descriptions. And we don’t have to chat with them one line at a time. We can write code to interact directly and in an automated way with the underlying models themselves.

In this case, we gave the model a “prompt” that described the data we wanted it to examine, the specific information we wanted it to retrieve, and the rules we wanted it to follow in doing so. Then we used an automated script to show it each episode description and ask it to retrieve any guest names. You can find the exact language we used in the methodology of our report.

We’ve used similar models and their precursors before. For instance, we used them to help us identify tweets expressing support or opposition to the Black Lives Matter movement and compare the language House Freedom Caucus members and other Republicans use on Twitter, now known as X. We’ve found that one of their biggest advantages is shortening the timeline for doing boring or rote classification work – exactly what we needed for this project.

Did you test this new tool? What were your concerns going into the project?

Meltem Odabaş, computational social scientist

Odabaş: Anytime we use a new tool, we test it extensively and don’t simply trust that it works as advertised. In this case, we had a few baseline metrics it needed to hit. We wanted to make sure it could perform the basic task of identifying when guests were mentioned in podcast descriptions. And we also wanted to make sure it wasn’t “hallucinating” – making up guests who weren’t there.

That process started with the prompt we designed to instruct the model on what we wanted it to do. These instructions restricted the source of its answers so that it only told us what was in the episode descriptions and didn’t draw from any other sources. And they told the model not to guess at the answer if it wasn’t sure if the episode had guests or not.

We also tested the model’s output on more than 2,000 randomly selected episode descriptions that our researchers had also categorized. By comparing those results, we could confirm that it performed about as well as our own trained researchers did – our threshold for using any sort of assistive technology in our published work. You can also find all those details in the report methodology.

What happened when you coded the full set of podcast episode descriptions? How did the tool stack up to how you might have done things in the past?

Stocking: The biggest takeaway for us is that the process was very fast – especially compared to doing it by hand. The model took just a few days to churn through all 24,000 descriptions. That’s a big improvement over multiple people working for weeks on end.

A lot of the issues we ran into were ones we would see with any classification project like this. Sometimes the model thought that people with similar names were the same person, or incorrectly classified show hosts as guests. But those are exactly the types of errors that human coders might make, too, and our validation and quality control processes ensured that we could spot those problems and fix them.

Still, we discovered some challenges that were unique to this tool. For instance, our model’s content moderation guardrails sometimes rejected episode descriptions if they mentioned concepts like crime or sex, even in fairly generic terms. It took some time to figure out what was happening and how to work around it.

All told, we found the tool to be extremely helpful for this particular project. But it isn’t something you can just turn loose and expect to see good results. It needs guidance, oversight and guardrails. We regularly check each other’s work when we number-check our reports, and that same sense of oversight applies to tools like this.

Aaron Smith  is director of Data Labs at Pew Research Center.