Every week tens of millions of Americans listen as their religious leaders provide teaching, comfort and guidance from the pulpit. But what are they hearing?

Today, Pew Research Center published “The Digital Pulpit,” its analysis of a broad swath of sermons delivered in U.S. churches during an eight-week period in 2019. Years in the making, the project employs advanced – and often specially built – computational tools to identify, transcribe and analyze nearly 50,000 sermons that U.S. churches livestreamed or shared on their websites.

We spoke with Dennis Quinn, the computational social scientist on the Center’s Data Labs team who spearheaded the project, on how it came together and the special challenges that arise when religion meets big data. The interview has been edited and condensed for clarity and concision.

This project has been a long time in the making. How did the idea for it come about?

Dennis Quinn, Computational Social Scientist, Pew Research Center
Dennis Quinn, Computational Social Scientist, Pew Research Center

I was interested in big data when I came to work at the Center’s Religious Restrictions project. So I asked Alan Cooperman, our director of religion research, if he had any ideas that might benefit from a big data approach, and he immediately brought up the idea of analyzing sermons.

The fundamental question was whether this was even feasible. For instance, is there a way to get any comprehensive list of churches with their websites? That led us to Google Maps, which we used to develop a database of churches. Was there a way for computers to identify sermons on churches’ websites? That led us to develop the machine learning technology that we used to identify the pages where congregations share their sermons.

Over that almost two and a half years, what were some of the biggest challenges you faced?

A project like this spans everything from the minutiae of database design and computer code to big issues of policy and direction. I had to really be careful that a technical decision I was asking an engineer to make in August 2017 wouldn’t somehow come around and cause some unintended consequence for others down the line in 2019.

For instance, at the time we searched for congregations on Google Maps we could not choose a single inclusive term for all types of congregations, so we used the term “church.” There was an alternative term, “place of worship,” but it wasn’t supported anymore in the program. That’s an example of a single line of code written in fall of 2017 which had huge implications for how we described our data in fall 2019 when we were writing the report. The “butterfly effect” of a big data project is staggering – the small, technical decisions that you make up front have colossal implications for the later direction of the project.

What were some of the privacy concerns that arose in the course of the project, and how did you address them?

Sermons often include people’s private religious moments, which they experienced in a real and often profound way. The churches did, of course, choose to share them online, so we felt it was appropriate to collect and analyze them, but we had to make sure that we were stewarding those data in a respectful way.

In a more technical sense, there were plenty of times that a website would have a password to get to the sermons, or they would be visible but stored in a way that they were hard to get to, and we decided that we weren’t going to touch those. If there was any effort at all on the part of the congregation to prevent any sort of automated collection, we made no effort to get past that. We also set limits on how fast the scraping program could move between pages, so as to ensure we didn’t overburden the congregational websites. We also decided as an added privacy precaution not to list their name or locations of specific congregations or make any of the sermon texts available.

Some readers might not know how common or uncommon it is for congregations to share their sermons online. Can you talk a bit about this?

The scraper found sermons on about 6,000 out of the roughly 38,000 congregational websites we examined. Bear in mind that since those congregations have websites on Google Maps, they’re already more online than some. But if you think about it from a pastor’s perspective, a sermon is the fruit of your labor, so it’s not necessarily outlandish that you’d want it to be heard by the broader world – or, for that matter, available to congregants who can’t make it to church.

Why did you decide to build your own dataset of sermons, rather than using a ready-made database like a sermon aggregator?

The “butterfly effect” of a big data project is staggering – the small, technical decisions that you make up front have colossal implications for the later direction of the project.
Dennis Quinn

When we at the Center are deciding whether to launch a new research project, we ask ourselves if this is something that we can do it in a meaningful and rigorous way that harnesses our technical abilities and resources. Of course, the data we did collect is still not representative of all U.S. sermons – these are still sermons that congregations with websites chose to share online – but by collecting them from actual congregations, we at least know that they can tell us about what a real swath of real churchgoers really heard during an eight-week period in 2019. We felt that if we were going to try to build a limited but insightful window into American religious services, we were going to do so in the best way possible. It was a “go big or go home” kind of moment.

Was there anything about the actual results that you found surprising?

The thing I was consistently in awe of was the sheer volume of information that we were working with, and not just in megabytes. The median sermon in the dataset is about 5,500 words, which is the length of a good-sized magazine article. I calculated that that’s about 80% longer than Federalist Paper Number 10. That’s a lot of information – we have 50,000 of these – and there are people out there internalizing this much information about the world around them on a weekly basis. And, in a technical sense, the fact that we were working with the equivalent of 50,000 magazine features was frankly intimidating.

Given the inherent limitations in your approach, what should readers of the report bear in mind as they read it?

Readers should approach the findings with those limitations in mind. The congregations that shared these sermons are by definition technology-enabled. They’re also larger and more urban – and of course, these are the sermons they chose to share.

Still, you can see parallels to what we know about American sermons from other sources. For instance, there are plenty of conceptual reasons you might expect discussion of the Old Testament to drop around Easter Sunday. Well, that’s exactly what happened in the data. The National Congregations Study, which is a representative survey of U.S. religious congregations, asks each congregation how long their most recent sermon lasted. They find that the median congregation reports 30 minutes. Well, we find that our median sermon runs 37 minutes. Considering that these are two entirely different ways of answering the same question, that’s not that far off.

If you could do a follow-up to this study, how would you build on it?

It would be great to get a better sense of the real humans on both sides of the altar – of the pastor and also the congregants. So if we were going to do this bigger and better, it would be a value-add to know more about the opinions of the pastors, the contents of the sermons, and how they affect the opinions of the congregants.

Note: See full report and methodology.

Drew DeSilver  is a senior writer at Pew Research Center.