Congress Soars to New Heights on Social Media

Methodology

By Patrick van Kessel, Regina Widjaya, Sono Shah, Aaron Smith and Adam Hughes

This analysis examines a complete set of Facebook posts and tweets created on any account managed by any member of the U.S. Senate and House of Representatives between Jan. 1, 2015, and May 31, 2020. Researchers used the Facebook Graph API, CrowdTangle API⁹ and Twitter API to download the posts. The resulting dataset contains nearly 1.5 million Facebook posts from 712 different members of Congress who used a total of 1,389 Facebook accounts, and over 3.3 million tweets from 711 different members of Congress who used a total of 1,362 Twitter accounts.

This analysis includes all text in the downloaded posts, including image captions and emojis. Photo and video posts were not included in this analysis unless the post also contained meaningful text, such as a caption; text contained entirely within an image was not included in the analysis. Posts by nonvoting representatives were excluded, as were any posts produced by politicians before or after their official terms in Congress.

To facilitate a more complete over-time analysis, posts created during congressional recesses were included, and terms of office (which typically begin and end in the first week of January) were adjusted by a few days to start and end at the beginning and end of each year, respectively. For example, posts by members of Congress that served full terms in the 115th Congress are included in the analysis if they were created between Jan. 1, 2017, and Dec. 31, 2018 (inclusive), even though the official term began on Jan. 3, 2017, and ended on Jan. 3, 2019. The few independent members of Congress who do not officially belong to the Democratic or Republican parties are treated as members of the party that they caucused with for the majority of the time period analyzed in the report (i.e., Bernie Sanders is considered a Democrat, and Justin Amash is considered a Republican).

Identifying social media accounts

The first step in the analysis was to identify each member’s official and unofficial Facebook pages and Twitter profiles. Most members of Congress maintain multiple social media accounts on each platform, consisting of both an “official” account as well as one or more campaign or personal accounts. Official accounts are used to communicate information as part of the member’s representational or legislative capacity, and Senate and House members may draw upon official staff resources appropriated by Congress when releasing content via these accounts. Personal and campaign accounts may not draw on these government resources under official House and Senate guidelines.¹⁰

Researchers started with an existing dataset of official and unofficial accounts for members of the 114th, 115th and 116th Congresses, and expanded it with supplementary data from the open-source @unitedstates project. Researchers then manually checked for additional accounts by conducting searches and checking the House and Senate pages of members who did not have at least two accounts on each platform. Every account was then manually reviewed and checked for accuracy. A handful of institutional accounts that periodically change ownership (e.g., committee chair accounts, @gopleader, etc.) were excluded from this analysis in favor of focusing on accounts that have been consistently owned and maintained by a single member of Congress. In total, researchers identified a list of 1,423 Facebook accounts and 1,450 Twitter accounts belonging to 715 different members of Congress.[11. In all, 34 Facebook pages and 88 Twitter accounts did not ultimately produce any eligible posts during the study period. These largely consisted of inactive or private personal accounts; all but three members of Congress were active on Facebook and all but four members of Congress were active on Twitter on at least one account at some point during the study.]

Cleaning Facebook posts

Researchers originally began collecting Facebook data in 2015 using the Facebook Graph API. However, the Facebook Graph API introduced new restrictions in mid-2018 that limited researchers’ ability to collect posts from public Facebook pages. CrowdTangle, a public insights tool owned by Facebook, introduced an API that provides researchers with comparable data collection access. Researchers used the new system to continue ongoing data collection efforts while also re-collecting existing posts from the beginning of 2015 in order to evaluate the coverage of both APIs and to look for any discrepancies. In most cases, data from these APIs appear to be virtually identical, but researchers discovered a number of idiosyncrasies that had to be addressed before any analysis could be conducted. These include changes in unique identifiers, auto-generated post text, and duplicate posts.

Correcting Facebook identifiers

Like most social media platforms, Facebook generates a unique identifier for each page and post on its platform. Page identifiers take the form of a long string of numbers, and post identifiers appear to follow one of two different patterns:

A standard two-part pattern, comprised of a prefix that corresponds to the unique ID of the authoring Facebook page, and a suffix that identifies the post (e.g., 12345_67890)
A new pattern introduced in the CrowdTangle API that does not contain a prefix (e.g., 67890:0:9)

While account usernames may occasionally change, page identifiers normally act as a permanent reference to a particular page. However, Center researchers have observed occasions in which unique page identifiers have unexpectedly changed. Most of these changes appear to have occurred after the end of an election season, when a number of politicians change the titles of their Facebook pages – removing suffixes such as “for Congress” or adding honorifics like “Senator” to their name. These changes to pages’ unique identifiers also impacted the prefixes of their posts’ unique identifiers as well. Researchers developed an extensive series of scripts to scan for and detect identifier changes and other kinds of mismatches, which could then be reviewed and corrected by researchers. Five pages in total appear to have changed their unique identifier.

Correcting Facebook post attributions

In addition to returning posts authored by the account owners themselves (“original content”), both the original Facebook Graph API and the CrowdTangle API also occasionally returned content that was posted to a politician’s Facebook page by a visitor (“guest content”), depending on the page’s privacy settings and how actively its owner curates their page. Using the original Graph API, determining whether a post was original or guest content was straightforward because the unique page identifier for a post’s author was contained in its metadata. However, the new CrowdTangle API does not provide this information.

To overcome this challenge, researchers examined a sample of posts and formulated a set of rules to determine a post’s author based on patterns found in a post’s unique identifier:

If a post was collected from the original Facebook Graph API, it is original (i.e. non-guest) content if the included “from_id” field matches the page’s unique identifier.
If a post was collected from CrowdTangle:
- If a post’s unique identifier follows the standard two-part pattern, then it is original content if the prefix matches the page’s unique identifier.
- If a post’s unique identifier follows the alternative CrowdTangle pattern, then it is original content if the link provided by the API does NOT include the text “fbid=”.
- Otherwise, the post is guest content.

After applying the above rules to the full database, researchers drew another sample of posts to check their accuracy, comprised of:

From each of five pages that had been observed to have multiple unique identifiers:
- Up to 10 random posts that were determined to be guest content, had a prefix that matched the page’s current identifier or one of its historical ones, and followed the standard two-part pattern.
- Up to 10 random posts that were determined to be original content, had a prefix that did not match the page’s current identifier or one of its historical ones, and followed the standard two-part pattern.
Across all of the remaining pages:
- 100 random posts that were determined to be guest content that followed the standard two-part pattern.
- 100 random posts that were determined to be guest content that followed the alternative CrowdTangle pattern.
- 100 random posts that were determined to be original content that followed the standard two-part pattern.
- 100 random posts that were determined to be original content that followed the alternative CrowdTangle pattern.

Researchers examined each of the 572 posts in the resulting sample and determined that every single post was correctly attributed.

Removing Facebook post annotations

Facebook posts collected from either API can contain text data in five different fields of a post: its story, message, caption, title and description. To represent the full content of each post, researchers concatenate these values into a single piece of text. Depending on the type of post (i.e., a photo, status update, link share, event, etc.), each individual field can contain different information: an actual message, a photo caption, a description of an event, a snippet from a news article or the name of a photo album.

One or more of a post’s text fields also frequently contain auto-generated annotations such as “Senator Smith posted 2 photos,” “Senator Smith updated their status” or “Senator Smith was live.” In some cases, posts are composed entirely of such annotations, such as when a politician adds a photo to an album without providing a caption.

While examining a sample of posts that had been collected from both the original Graph API as well as the CrowdTangle API, researchers noticed that the presence and/or location of these annotations sometimes varied between the original version of the post and the new CrowdTangle version (e.g., text in the “story” and “caption” field were swapped). Some CrowdTangle posts also appeared to contain more auto-generated annotations than versions obtained via the Graph API.

Normally, researchers could address these discrepancies by simply choosing to preserve the newer CrowdTangle version of each post, accepting that the post’s text content might shuffle or expand slightly. However, researchers had previously observed the presence of duplicate posts from the original API, and suspected that duplicates were also likely to be present in new CrowdTangle posts – especially among those that had been collected using both methods. Because some of the techniques used to identify potential duplicates rely on comparisons between the text of different posts and the differences in auto-generated annotations were making these comparisons difficult, researchers had to determine a way to remove the annotations prior to deduplication.

Researchers iteratively drew samples of posts using every post category (i.e. photos, status updates, etc.) and every combination of null and non-null text values across all five text fields, and developed a series of regular expressions to capture every observed annotation. This was repeated until no more annotation patterns could be found. Researchers then applied the 13 resulting patterns to remove annotations across the entire database.

In contrast to prior Center reports on congressional social media use that focused exclusively on Facebook, this new analysis includes data from Twitter, which does not produce any comparable annotations. Accordingly, researchers decided to remove annotations from Facebook posts not only for the deduplication process, but also for the entirety of this analysis. Exclusion of this auto-generated content not only allows for a direct comparison between the two platforms, but also allows researchers to focus on politicians’ substantive messaging. Facebook photos, videos, and other posts that contained a substantive message or caption are still included, just without additional annotations; 29,979 Facebook posts that were composed entirely of annotations were excluded from the analysis.

De-duplicating Facebook posts

After removing auto-generated annotations, researchers scanned the entire database for potential duplicates by identifying any pair of posts that met any of the following criteria:

The posts were posted at the exact same time (to the second).
The posts had identical links to an internal Facebook URL, and the posts were created within 48 hours of each other.
The posts had identical unique identifier suffixes but one followed an alternative convention that appears in CrowdTangle data (e.g. 12345_67890 vs. 67890:9:0).
The posts were created within 48 hours of each other and their text was at least 60% similar (based on both TF-IDF cosine similarity and Levenshtein ratios).

A sample of 100 potential duplicate pairs were examined by two coders each, who viewed each post directly on the Facebook website to determine whether the posts were duplicates. There were 19 pairs where one or more posts were no longer viewable because a page or post had been deleted, and for the remaining 81 pairs the coders were in perfect agreement about whether the posts were duplicates.

A new sample of 1,000 posts was then extracted and divided amongst the coders, who repeated the exercise. Of these posts, 873 were fully viewable, 47% of which were duplicate posts and 53% of which were false positives. This sample was then used to train an XGBoost classification model with the following features:

Whether or not any of the following fields were identical: the Facebook ID, creation time, type, status type, link, alternative link, title, story, message, caption, description, source, likes, shares, total comments, document text, prefix of the alternative link, and all fields which capture the the total number of each kind of reaction (e.g., “haha”);
Raw difference in likes, comments, shares;
Ratio of the difference between likes, comments, shares;
Fuzzy ratio, partial fuzzy ratio, minimum length, maximum length, and ratio between the longest and shortest value for: title, story, message, caption, description, source, document text, link, alternative link, alternative link prefix, Facebook ID, picture link;
Whether the timestamps had the same: day, hour, minute, second;
Difference in timestamps in seconds;
Whether the difference in seconds was exactly divisible by 60;
Number of overlapping characters in the Facebook IDs;
Whether each character position in the FBIDs matched;
Number of overlapping characters in the FBID suffix;
Whether the posts were both photos, based on type and status type.

The model achieved an average of 0.98 precision and 0.97 recall using 5-fold cross validation, and 1.0 precision and 0.97 recall on the held-out set of 81 posts that were initially coded by human judges. Cohen’s Kappa was 0.97. This model was then applied to the full database. Of the posts examined in this analysis, 31,883 were determined to be duplicates and were removed (2% of all posts that remained after removing auto-generated annotations).

Additional data cleaning and filtering

Engagement analysis

Engagement with posts on Facebook can come in the form of likes, comments, shares (when another user reshares or “forwards” a politician’s post) and a variety of other emoticon reactions (e.g., angry, happy, sad, etc.). In the same way, politicians can engage with posts produced by other Facebook users. When a politician shares a post from another account, Facebook effectively creates a new copy of the post with its own engagement, allowing researchers to track how many users share the politician’s copy of the post.

Engagement on Twitter functions much the same way; tweets can receive “favorites” (the equivalent of a “like” on Facebook) and can be retweeted by others (the equivalent of a “share” on Facebook). However, when a user retweets a tweet that was produced by another account, the user’s copy of the tweet does not distinguish between favorites and retweets for the original tweet versus those for the retweet. Researchers therefore cannot attribute the number of favorites and retweets that the tweet received to the retweeter. Accordingly, all analysis in this report that examines the number of times politicians’ tweets were favorited or reshared by others are restricted to the 76% of tweets in the dataset that were originally authored by the politicians themselves.

Engagement keywords

To identify keywords associated with boosts in engagement, the text of each document was converted into a set of features, representing words and phrases. To accomplish this, each document was passed through a series of pre-processing functions. First, researchers removed 3,059 “stop words” that included common English words, names and abbreviations for states and months, numerical terms like “first,” and a handful of generic terms common on social media platforms like “Facebook” and “retweet.” The text of each post was then lowercased, and URLs and links were removed using a regular expression. Common contractions were expanded into their constituent words, punctuation was removed and each sentence was tokenized using the resulting white space. Finally, words were lemmatized (reduced to their semantic root form) and filtered to those containing three or more characters. Terms were then grouped into one, two and three-word phrases.

Then, for each year, party, platform and term size combination, researchers trained two Stochastic Gradient Descent (SGD) L2-penalized ridge regression models: one to predict the logged number of favorites or reactions a post received, and another to predict the logged number of shares or retweets. Each model attempted to predict these values using two sets of features: binary flags (“dummy variables”) for each politician, and binary flags indicating whether or not each post mentioned any keyword or phrase that was used by at least 100 politicians and in at least 0.1% of the posts. After each model was trained, researchers predicted the favorites/reactions and shares/retweets for each word or phrase flag and each politician, and calculated the keyword’s predicted effect for the median politician. These effects were then compared against the predicted engagement for a post from the median politician that didn’t mention any of the words or phrases included in the model, represented as a percentage difference. After combining all of the model predictions for all one-, two- and three-word phrases from each year, party, and platform combination, researchers then identified terms that were associated with at least a 10% boost in both favorites/reactions and shares/retweets on both platforms and were used at least 1,000 times in posts from a specific party in a given year. Finally, researchers averaged the predicted boosts for each keyword across platforms and metrics (favorites, reactions, shares and retweets) to select the top keywords for each party and year. The resulting selection of keywords represent those that were associated with notably higher engagement on both platforms.

Event detection

To identify periods of unusually high engagement, researchers computed the median politician’s average daily favorites/reactions and shares/retweets for each platform and party combination, and computed day-over-day percentage changes. Events were then defined as starting on any day in which the median legislator’s average favorites, reactions, shares and retweets increased by at least 10% on both platforms, and continuing for as long as both engagement metrics on each platform continued to increase day-over-day. In other words, events are defined as periods of increasing engagement on both platforms, starting with days in which engagement on both platforms jumped at least 10% relative to the day prior. Researchers then computed the overall percentage increase in each platform metric for the event by comparing the final day of the event (when engagement was at its peak) to the day prior to its start.

Researchers then labeled each event by first identifying keywords that were distinctive of the event period relative to the seven days prior to the event (using pointwise mutual information) and second, by searching for historical news headlines that were topically related to the keywords.

Follower/subscriber trends

Researchers also collected data about the number of followers each Twitter profile had and the number of followers – also called subscribers or “page likes” – that each Facebook page had over time. This information has been updated regularly since 2015, but not always every day. In some cases, researchers did not identify and start tracking an account until well after its creation, or did not capture information for an account in its final days before deletion. Follower counts for each account are therefore only available for the period between when researchers first collected data on the account and the last time data was collected prior to an account being deleted. For dates within these periods, missing information is filled in using linear interpolation, to provide a close estimate of the number of followers each account had at each point in time. This process produces reliable estimates overall, even if estimates for individual accounts at particular points in time may be approximate.

Most of the missing data is concentrated at the beginning of the study period in 2015, when researchers were still building the database and identifying accounts. Accordingly, follower data prior to March 2016 are not reported in this analysis. In each year since 2016, researchers managed to collect follower data for at least 97% of the Facebook accounts and 90% of the Twitter profiles included in this study; coverage on both platforms was 100% in 2020. For 97% of the Facebook accounts and 96% of the Twitter accounts included in this study, follower data was successfully captured within seven days of the account’s most recent (or final) post or the legislator’s end of term (whichever came first).

Missing data

For reasons that appear to be related to both API and data parsing errors, a small number of tweets and Facebook posts from 2015, 2016 and 2017 are missing from the database. The missing data does not appear to be systematic, but is rather spread across hundreds of accounts. Text content was missing from 1,185 tweets, and these tweets were excluded from the analysis. Like and comment counts were missing from 16 Facebook posts, and share counts were missing from 2,008 posts; these posts are excluded in analyses of their respective engagement statistics, but are otherwise included (their text was not missing).

Next: Appendix A: Most-followed members of 116th Congress

CrowdTangle is a public insights tool owned by Facebook. ↩
Straus, Jacob R. and Glassman, Matthew E. “Social Media in Congress: The Impact of Electronic Media on Member Communications.” Congressional Research Service, 2016. ↩

his analysis examines a complete set of Facebook posts and tweets created on any account managed by any member of the U.S. Senate and House of Representatives between Jan. 1, 2015, and May 31, 2020. Researchers used the Facebook Graph API, CrowdTangle API[9. numoffset="9" CrowdTangle is a public insights tool owned by Facebook.] and Twitter API to download the posts. The resulting dataset contains nearly 1.5 million Facebook posts from 712 different members of Congress who used a total of 1,389 Facebook accounts, and over 3.3 million tweets from 711 different members of Congress who used a total of 1,362 Twitter accounts. This analysis includes all text in the downloaded posts, including image captions and emojis. Photo and video posts were not included in this analysis unless the post also contained meaningful text, such as a caption; text contained entirely within an image was not included in the analysis. Posts by nonvoting representatives were excluded, as were any posts produced by politicians before or after their official terms in Congress. To facilitate a more complete over-time analysis, posts created during congressional recesses were included, and terms of office (which typically begin and end in the first week of January) were adjusted by a few days to start and end at the beginning and end of each year, respectively. For example, posts by members of Congress that served full terms in the 115th Congress are included in the analysis if they were created between Jan. 1, 2017, and Dec. 31, 2018 (inclusive), even though the official term began on Jan. 3, 2017, and ended on Jan. 3, 2019. The few independent members of Congress who do not officially belong to the Democratic or Republican parties are treated as members of the party that they caucused with for the majority of the time period analyzed in the report (i.e., Bernie Sanders is considered a Democrat, and Justin Amash is considered a Republican).

Identifying social media accounts his analysis examines a complete set of Facebook posts and tweets created on any account managed by any member of the U.S. Senate and House of Representatives between Jan. 1, 2015, and May 31, 2020. Researchers used the Facebook Graph API, CrowdTangle API[9. numoffset="9" CrowdTangle is a public insights tool owned by Facebook.] and Twitter API to download the posts. The resulting dataset contains nearly 1.5 million Facebook posts from 712 different members of Congress who used a total of 1,389 Facebook accounts, and over 3.3 million tweets from 711 different members of Congress who used a total of 1,362 Twitter accounts. This analysis includes all text in the downloaded posts, including image captions and emojis. Photo and video posts were not included in this analysis unless the post also contained meaningful text, such as a caption; text contained entirely within an image was not included in the analysis. Posts by nonvoting representatives were excluded, as were any posts produced by politicians before or after their official terms in Congress. To facilitate a more complete over-time analysis, posts created during congressional recesses were included, and terms of office (which typically begin and end in the first week of January) were adjusted by a few days to start and end at the beginning and end of each year, respectively. For example, posts by members of Congress that served full terms in the 115th Congress are included in the analysis if they were created between Jan. 1, 2017, and Dec. 31, 2018 (inclusive), even though the official term began on Jan. 3, 2017, and ended on Jan. 3, 2019. The few independent members of Congress who do not officially belong to the Democratic or Republican parties are treated as members of the party that they caucused with for the majority of the time period analyzed in the report (i.e., Bernie Sanders is considered a Democrat, and Justin Amash is considered a Republican).
Identifying social media accounts
The first step in the analysis was to identify each member’s official and unofficial Facebook pages and Twitter profiles. Most members of Congress maintain multiple social media accounts on each platform, consisting of both an “official” account as well as one or more campaign or personal accounts. Official accounts are used to communicate information as part of the member’s representational or legislative capacity, and Senate and House members may draw upon official staff resources appropriated by Congress when releasing content via these accounts. Personal and campaign accounts may not draw on these government resources under official House and Senate guidelines.[10. Straus, Jacob R. and Glassman, Matthew E. “Social Media in Congress: The Impact of Electronic Media on Member Communications.” Congressional Research Service, 2016.] Researchers started with an existing dataset of official and unofficial accounts for members of the 114th, 115th and 116th Congresses, and expanded it with supplementary data from the open-source @unitedstates project. Researchers then manually checked for additional accounts by conducting searches and checking the House and Senate pages of members who did not have at least two accounts on each platform. Every account was then manually reviewed and checked for accuracy. A handful of institutional accounts that periodically change ownership (e.g., committee chair accounts, @gopleader, etc.) were excluded from this analysis in favor of focusing on accounts that have been consistently owned and maintained by a single member of Congress. In total, researchers identified a list of 1,423 Facebook accounts and 1,450 Twitter accounts belonging to 715 different members of Congress.[11. In all, 34 Facebook pages and 88 Twitter accounts did not ultimately produce any eligible posts during the study period. These largely consisted of inactive or private personal accounts; all but three members of Congress were active on Facebook and all but four members of Congress were active on Twitter on at least one account at some point during the study.]

Cleaning Facebook postshis analysis examines a complete set of Facebook posts and tweets created on any account managed by any member of the U.S. Senate and House of Representatives between Jan. 1, 2015, and May 31, 2020. Researchers used the Facebook Graph API, CrowdTangle API[9. numoffset="9" CrowdTangle is a public insights tool owned by Facebook.] and Twitter API to download the posts. The resulting dataset contains nearly 1.5 million Facebook posts from 712 different members of Congress who used a total of 1,389 Facebook accounts, and over 3.3 million tweets from 711 different members of Congress who used a total of 1,362 Twitter accounts. This analysis includes all text in the downloaded posts, including image captions and emojis. Photo and video posts were not included in this analysis unless the post also contained meaningful text, such as a caption; text contained entirely within an image was not included in the analysis. Posts by nonvoting representatives were excluded, as were any posts produced by politicians before or after their official terms in Congress. To facilitate a more complete over-time analysis, posts created during congressional recesses were included, and terms of office (which typically begin and end in the first week of January) were adjusted by a few days to start and end at the beginning and end of each year, respectively. For example, posts by members of Congress that served full terms in the 115th Congress are included in the analysis if they were created between Jan. 1, 2017, and Dec. 31, 2018 (inclusive), even though the official term began on Jan. 3, 2017, and ended on Jan. 3, 2019. The few independent members of Congress who do not officially belong to the Democratic or Republican parties are treated as members of the party that they caucused with for the majority of the time period analyzed in the report (i.e., Bernie Sanders is considered a Democrat, and Justin Amash is considered a Republican).
Identifying social media accounts
The first step in the analysis was to identify each member’s official and unofficial Facebook pages and Twitter profiles. Most members of Congress maintain multiple social media accounts on each platform, consisting of both an “official” account as well as one or more campaign or personal accounts. Official accounts are used to communicate information as part of the member’s representational or legislative capacity, and Senate and House members may draw upon official staff resources appropriated by Congress when releasing content via these accounts. Personal and campaign accounts may not draw on these government resources under official House and Senate guidelines.[10. Straus, Jacob R. and Glassman, Matthew E. “Social Media in Congress: The Impact of Electronic Media on Member Communications.” Congressional Research Service, 2016.] Researchers started with an existing dataset of official and unofficial accounts for members of the 114th, 115th and 116th Congresses, and expanded it with supplementary data from the open-source @unitedstates project. Researchers then manually checked for additional accounts by conducting searches and checking the House and Senate pages of members who did not have at least two accounts on each platform. Every account was then manually reviewed and checked for accuracy. A handful of institutional accounts that periodically change ownership (e.g., committee chair accounts, @gopleader, etc.) were excluded from this analysis in favor of focusing on accounts that have been consistently owned and maintained by a single member of Congress. In total, researchers identified a list of 1,423 Facebook accounts and 1,450 Twitter accounts belonging to 715 different members of Congress.[11. In all, 34 Facebook pages and 88 Twitter accounts did not ultimately produce any eligible posts during the study period. These largely consisted of inactive or private personal accounts; all but three members of Congress were active on Facebook and all but four members of Congress were active on Twitter on at least one account at some point during the study.]

Cleaning Facebook posts

Researchers originally began collecting Facebook data in 2015 using the Facebook Graph API. However, the Facebook Graph API introduced new restrictions in mid-2018 that limited researchers’ ability to collect posts from public Facebook pages. CrowdTangle, a public insights tool owned by Facebook, introduced an API that provides researchers with comparable data collection access. Researchers used the new system to continue ongoing data collection efforts while also re-collecting existing posts from the beginning of 2015 in order to evaluate the coverage of both APIs and to look for any discrepancies. In most cases, data from these APIs appear to be virtually identical, but researchers discovered a number of idiosyncrasies that had to be addressed before any analysis could be conducted. These include changes in unique identifiers, auto-generated post text, and duplicate posts.

Correcting Facebook identifiers

Like most social media platforms, Facebook generates a unique identifier for each page and post on its platform. Page identifiers take the form of a long string of numbers, and post identifiers appear to follow one of two different patterns:

A standard two-part pattern, comprised of a prefix that corresponds to the unique ID of the authoring Facebook page, and a suffix that identifies the post (e.g., 12345_67890)
A new pattern introduced in the CrowdTangle API that does not contain a prefix (e.g., 67890:0:9)

While account usernames may occasionally change, page identifiers normally act as a permanent reference to a particular page. However, Center researchers have observed occasions in which unique page identifiers have unexpectedly changed. Most of these changes appear to have occurred after the end of an election season, when a number of politicians change the titles of their Facebook pages – removing suffixes such as “for Congress” or adding honorifics like “Senator” to their name. These changes to pages’ unique identifiers also impacted the prefixes of their posts’ unique identifiers as well. Researchers developed an extensive series of scripts to scan for and detect identifier changes and other kinds of mismatches, which could then be reviewed and corrected by researchers. Five pages in total appear to have changed their unique identifier.

Correcting Facebook post attributions

In addition to returning posts authored by the account owners themselves (“original content”), both the original Facebook Graph API and the CrowdTangle API also occasionally returned content that was posted to a politician’s Facebook page by a visitor (“guest content”), depending on the page’s privacy settings and how actively its owner curates their page. Using the original Graph API, determining whether a post was original or guest content was straightforward because the unique page identifier for a post’s author was contained in its metadata. However, the new CrowdTangle API does not provide this information. To overcome this challenge, researchers examined a sample of posts and formulated a set of rules to determine a post’s author based on patterns found in a post’s unique identifier:

If a post was collected from the original Facebook Graph API, it is original (i.e. non-guest) content if the included “from_id” field matches the page’s unique identifier.
If a post was collected from CrowdTangle:
- If a post’s unique identifier follows the standard two-part pattern, then it is original content if the prefix matches the page’s unique identifier.
- If a post’s unique identifier follows the alternative CrowdTangle pattern, then it is original content if the link provided by the API does NOT include the text “fbid=”.
- Otherwise, the post is guest content.

After applying the above rules to the full database, researchers drew another sample of posts to check their accuracy, comprised of:

From each of five pages that had been observed to have multiple unique identifiers:
- Up to 10 random posts that were determined to be guest content, had a prefix that matched the page’s current identifier or one of its historical ones, and followed the standard two-part pattern.
- Up to 10 random posts that were determined to be original content, had a prefix that did not match the page’s current identifier or one of its historical ones, and followed the standard two-part pattern.
Across all of the remaining pages:
- 100 random posts that were determined to be guest content that followed the standard two-part pattern.
- 100 random posts that were determined to be guest content that followed the alternative CrowdTangle pattern.
- 100 random posts that were determined to be original content that followed the standard two-part pattern.
- 100 random posts that were determined to be original content that followed the alternative CrowdTangle pattern.

Researchers examined each of the 572 posts in the resulting sample and determined that every single post was correctly attributed.

Removing Facebook post annotations

Facebook posts collected from either API can contain text data in five different fields of a post: its story, message, caption, title and description. To represent the full content of each post, researchers concatenate these values into a single piece of text. Depending on the type of post (i.e., a photo, status update, link share, event, etc.), each individual field can contain different information: an actual message, a photo caption, a description of an event, a snippet from a news article or the name of a photo album. One or more of a post’s text fields also frequently contain auto-generated annotations such as “Senator Smith posted 2 photos,” “Senator Smith updated their status” or “Senator Smith was live.” In some cases, posts are composed entirely of such annotations, such as when a politician adds a photo to an album without providing a caption. While examining a sample of posts that had been collected from both the original Graph API as well as the CrowdTangle API, researchers noticed that the presence and/or location of these annotations sometimes varied between the original version of the post and the new CrowdTangle version (e.g., text in the “story” and “caption” field were swapped). Some CrowdTangle posts also appeared to contain more auto-generated annotations than versions obtained via the Graph API. Normally, researchers could address these discrepancies by simply choosing to preserve the newer CrowdTangle version of each post, accepting that the post’s text content might shuffle or expand slightly. However, researchers had previously observed the presence of duplicate posts from the original API, and suspected that duplicates were also likely to be present in new CrowdTangle posts – especially among those that had been collected using both methods. Because some of the techniques used to identify potential duplicates rely on comparisons between the text of different posts and the differences in auto-generated annotations were making these comparisons difficult, researchers had to determine a way to remove the annotations prior to deduplication. Researchers iteratively drew samples of posts using every post category (i.e. photos, status updates, etc.) and every combination of null and non-null text values across all five text fields, and developed a series of regular expressions to capture every observed annotation. This was repeated until no more annotation patterns could be found. Researchers then applied the 13 resulting patterns to remove annotations across the entire database. In contrast to prior Center reports on congressional social media use that focused exclusively on Facebook, this new analysis includes data from Twitter, which does not produce any comparable annotations. Accordingly, researchers decided to remove annotations from Facebook posts not only for the deduplication process, but also for the entirety of this analysis. Exclusion of this auto-generated content not only allows for a direct comparison between the two platforms, but also allows researchers to focus on politicians’ substantive messaging. Facebook photos, videos, and other posts that contained a substantive message or caption are still included, just without additional annotations; 29,979 Facebook posts that were composed entirely of annotations were excluded from the analysis.

De-duplicating Facebook posts

After removing auto-generated annotations, researchers scanned the entire database for potential duplicates by identifying any pair of posts that met any of the following criteria:

The posts were posted at the exact same time (to the second).
The posts had identical links to an internal Facebook URL, and the posts were created within 48 hours of each other.
The posts had identical unique identifier suffixes but one followed an alternative convention that appears in CrowdTangle data (e.g. 12345_67890 vs. 67890:9:0).
The posts were created within 48 hours of each other and their text was at least 60% similar (based on both TF-IDF cosine similarity and Levenshtein ratios).

A sample of 100 potential duplicate pairs were examined by two coders each, who viewed each post directly on the Facebook website to determine whether the posts were duplicates. There were 19 pairs where one or more posts were no longer viewable because a page or post had been deleted, and for the remaining 81 pairs the coders were in perfect agreement about whether the posts were duplicates. A new sample of 1,000 posts was then extracted and divided amongst the coders, who repeated the exercise. Of these posts, 873 were fully viewable, 47% of which were duplicate posts and 53% of which were false positives. This sample was then used to train an XGBoost classification model with the following features:

Whether or not any of the following fields were identical: the Facebook ID, creation time, type, status type, link, alternative link, title, story, message, caption, description, source, likes, shares, total comments, document text, prefix of the alternative link, and all fields which capture the the total number of each kind of reaction (e.g., “haha”);
Raw difference in likes, comments, shares;
Ratio of the difference between likes, comments, shares;
Fuzzy ratio, partial fuzzy ratio, minimum length, maximum length, and ratio between the longest and shortest value for: title, story, message, caption, description, source, document text, link, alternative link, alternative link prefix, Facebook ID, picture link;
Whether the timestamps had the same: day, hour, minute, second;
Difference in timestamps in seconds;
Whether the difference in seconds was exactly divisible by 60;
Number of overlapping characters in the Facebook IDs;
Whether each character position in the FBIDs matched;
Number of overlapping characters in the FBID suffix;
Whether the posts were both photos, based on type and status type.

The model achieved an average of 0.98 precision and 0.97 recall using 5-fold cross validation, and 1.0 precision and 0.97 recall on the held-out set of 81 posts that were initially coded by human judges. Cohen’s Kappa was 0.97. This model was then applied to the full database. Of the posts examined in this analysis, 31,883 were determined to be duplicates and were removed (2% of all posts that remained after removing auto-generated annotations).

Additional data cleaning and filtering

Appendix A: Most-followed members of 116th Congress

Topics

Regions & Countries

Formats

Topics

Regions & Countries

Formats

Methodology

Identifying social media accounts

Cleaning Facebook posts

Correcting Facebook identifiers

Correcting Facebook post attributions

Removing Facebook post annotations

De-duplicating Facebook posts

Additional data cleaning and filtering

Engagement analysis

Engagement keywords

Event detection

Follower/subscriber trends

Missing data

Report Materials

Table of Contents

Identifying social media accounts

Identifying social media accounts

Cleaning Facebook posts

Correcting Facebook identifiers

Correcting Facebook post attributions

Removing Facebook post annotations

De-duplicating Facebook posts

Additional data cleaning and filtering

MOST POPULAR

Topics

Regions & Countries

Formats

Topics

Regions & Countries

Formats

Identifying social media accounts

Cleaning Facebook posts

Correcting Facebook identifiers

Correcting Facebook post attributions

Removing Facebook post annotations

De-duplicating Facebook posts

Additional data cleaning and filtering

Engagement analysis

Engagement keywords

Event detection

Follower/subscriber trends

Missing data

Sign up for our Internet, Science and Tech newsletter

Report Materials

Table of Contents

Identifying social media accounts

Identifying social media accounts

Cleaning Facebook posts

Correcting Facebook identifiers

Correcting Facebook post attributions

Removing Facebook post annotations

De-duplicating Facebook posts

Additional data cleaning and filtering

Related

Charting Congress on Social Media in the 2016 and 2020 Elections

Members of Congress – especially Republicans – are increasingly discussing China on social media

Facebook Posts in Early Days of Biden Administration Reflect Ideological Divide

Americans and ‘Cancel Culture’: Where Some See Calls for Accountability, Others See Censorship, Punishment

Americans divided on whether Trump should be permanently banned from social media

MOST POPULAR