The Center for Public Integrity is a nonprofit newsroom that investigates betrayals of public trust. Sign up to receive our stories.
As deadly Ebola raged in Africa and threatened the United States, the Centers for Disease Control and Prevention pinpointed a problem: The agency had many sources of data on the disease but no easy way to combine them, analyze them on a single platform and share the information with partners. It was using several spreadsheets and applications for this work — a process that was “manual, labor-intensive, time-consuming,” according to the agency’s request for proposals to solve the problem. It spent millions building a new platform.
But at the beginning of the coronavirus pandemic, the CDC still struggled to integrate and share data. The system it had built during the Ebola crisis wasn’t up to the task. An effort to modernize all of the agency’s data collection and analysis was ongoing: One CDC official told a congressional committee in March that if the agency had modern data infrastructure, it would have detected the coronavirus “much, much sooner” and would have contained it “further and more effectively.”
By April, with coronavirus cases spiking in the U.S. and officials scrambling to wrangle information about the pandemic, the CDC had a proof-of-concept for a new system to pull together all of its various data streams. But it was having trouble figuring out how to securely add users outside the agency, as well as get the funding and political backing needed to expand it, according to two sources with close knowledge of the situation.
So the CDC turned to outsiders for help. Information technology experts at the federal Department of Health and Human Services took control of the project. Five days later, they had a working platform, dubbed HHS Protect, with the ability to combine, search and map scores of datasets on deaths, symptoms, tests, ventilators, masks, local ordinances and more.
The new, multimillion-dollar data warehouse has continued to grow since then; it holds more than 200 datasets containing billions of pieces of information from both public and private sources. And now, aided by artificial intelligence, it is shaping the way the federal government addresses the pandemic, even as it remains a source of contention between quarreling health agencies and a target for transparency advocates who say it’s too secretive.
The Center for Public Integrity is the first to reveal details about how the platform came to be and how it is now being used. Among other things, it helps the White House and federal agencies distribute scarce treatment drugs and supplies, line up patients for vaccine clinical trials, and dole out advice to state and local leaders. Federal officials are starting to use a $20 million artificial intelligence system to mine the mountain of data the platform contains.
People familiar with HHS Protect say it could be the largest advance in public health surveillance in the United States in decades. But until now it has been mostly known as a key example of President Trump’s willingness to sideline CDC scientists: In July, his administration suddenly required hospitals to send information on bed occupancy to the new system instead of the CDC.
The Trump administration has added to the anxiety surrounding HHS Protect by keeping it wrapped in secrecy, refusing to publicly share many of the insights it generates.
“I want to be optimistic that everything is happening here is actually a net improvement,” said Nick Hart, CEO of the Data Coalition, a nonprofit that advocates for open government data. “The onus is really on HHS to explain what’s happening and be as transparent as possible… It’s difficult to assess whether it really is headed in the right direction.”
A long history of data frustration
To hear some tell it, the reason behind the CDC’s long struggle to upgrade its data systems can be learned in its name: the Centers — plural — for Disease Control and Prevention. Twelve centers, to be exact, and a jumble of other offices, each with its own expertise and limited funding: the National Center for Immunization and Respiratory Diseases, for example, or the Center for Preparedness and Response. Scientists at each myopically focus on their own needs and strain to work together on expensive projects to benefit all, such as upgrading shared data systems, experts familiar with the CDC said. A 2019 report from the Council of State and Territorial Epidemiologists found that the agency had more than 100 stand-alone, disease-specific tracking systems, few of them able to talk to each other, let alone add in outside data that could help responders stanch outbreaks.
“CDC has been doing things a certain way for decades,” said a person familiar with the creation of HHS Protect who was not authorized to speak on the record. “Sometimes epidemiologists are not technologists.”
The U.S. government knew for more than a decade it needed a comprehensive system to collect, analyze and share data in real time if a pandemic reached America’s shores. The 2006 Pandemic and All-Hazards Preparedness Act directed federal health officials to build such a system; in 2010 the Government Accountability Office found that they hadn’t. A 2013 version of the law required the same thing; in 2017 the GAO found again that it hadn’t happened. Congress passed another law in 2019 calling for the system yet again. In 2020 the coronavirus struck.
“We’ve had no shortage of events that have demonstrated the importance of bringing together both healthcare and public health information in a usable, deeply accessible platform,” said Dr. Dan Hanfling, a vice president at In-Q-Tel, a nonprofit with ties to the CIA that invests in technology helpful to the government. “We’ve missed the mark.”
In fighting a pandemic, the nation struggles with data at every turn: from collecting information about what’s happening on the ground, to analyzing it, to sharing it to sending information back to the front lines. The CDC still relies on underfunded state health departments using antiquated equipment — even fax machines — to gather some types of information. The agency for years has also had ongoing, formal efforts to upgrade its data processes.
“There’ve been a lot of false starts in this area,” said Dr. Tom Frieden, the head of the CDC during the Obama administration. Frieden blamed money already spent on existing systems and local governments unwilling to make changes, among other reasons. “We had decades of underinvestment in public health at the national, state and local levels, and that includes information systems.”
The CDC attempted to fix at least some of those problems — joining and analyzing and sharing data from disparate sources — with the system it built during Ebola, known as DCIPHER. The system saved the agency thousands of hours of staff time as it responded to a salmonella outbreak and lung injuries from vaping. But it couldn’t keep up with the coronavirus. It was stored on CDC servers instead of the cloud and couldn’t handle the flood of extra data and users needed to fight COVID-19, according to two sources with knowledge of the situation.
So CDC officials handed the proof-of-concept for a new system to the chief information officer of HHS, Jose Arrieta. The CDC was having trouble figuring out how to approve and ensure the identities of new users from outside the agency, such as the White House Coronavirus Task Force, and give them appropriate permissions to view data, according to two sources with close knowledge of the situation. Arrieta and his team solved the technical problems, stitching together eight pieces of commercial software to build the platform and pulling in data from both private and public sources, including the CDC.
“Our goal was to create the best view of what’s occurring in the United States as it relates to COVID-19,” said Arrieta, a career civil servant who has worked for both Republicans and Democrats, speaking for the first time since his sudden departure from HHS in August. He said, and a friend confirmed, that he left his job primarily to spend more time with his young children after months of round-the-clock work. “It changes public health forever.”
HHS Protect now helps federal agencies distribute testing supplies and the scarce COVID-19 treatment drug remdesivir, identify coronavirus patients for vaccine clinical trials, write secret White House Coronavirus Task Force reports sent to governors, determine how often nursing homes must test their staffs for infection, inform the outbreak warnings White House adviser Dr. Deborah Birx has been issuing to cities in private phone calls — and more.
The system allows users to analyze, visualize and map information so they can, for example, see how weakening local health ordinances could affect restaurant spending and coronavirus deaths in mid-size cities across America. Arrieta’s team assembled the platform from eight pieces of commercial software, including one purchased via sole-source contracts worth $24.9 million from Palantir Technologies, a controversial company known for its work with U.S. intelligence agencies and founded by Trump donor Peter Thiel. CDC used the Palantir software for both the HHS Protect prototype and DCIPHER, and it works well, Arrieta said; contracting documents cited the coronavirus emergency when justifying the quick purchase.
And now a new artificial intelligence component of the platform, called HHS Vision, will help predict how particular interventions, such as distributing extra masks in nursing homes, could stanch local outbreaks. Arrieta said HHS Vision, which is not run with Palantir software, uses pre-written algorithms to simulate behaviors and forecast possible outcomes using what experts call “supervised machine learning.”
Though many of the datasets in HHS Protect are public, a scientist who wanted to use them would have to hunt for them from many agencies, clean them and help them relate to one another. That work is already done in HHS Protect.
“It is a big leap forward,” said Dr. Wilbert van Panhuis, an epidemiologist at the University of Pittsburgh who is working to get access to the platform for a group of 600 researchers. “They are making major progress in this pandemic.”
But the new system became a source of controversy this summer when officials told hospitals to stop reporting information on beds and patients to a well-known and revered CDC system, the National Healthcare Safety Network, and instead send it to Teletracking, a private contractor connected to HHS Protect. Observers feared the move undermined science and was another example of political interference with the CDC’s work. In August, hospital bed data from Teletracking sometimes diverged wildly from what states were reporting, though now it aligns more closely, said Jessica Malaty Rivera, science communication lead for the Covid Tracking Project, a volunteer organization compiling pandemic data.
“If there’s one major lesson we have from emergencies in the last 20 years… it’s not to try to create a new system but take the most robust system you have and scale it,” Frieden said. “The way to make Americans safer is to build on, not bypass, our public health system.”
Some familiar with the switch from the CDC to Teletracking said it allowed the federal government to compile more data on more hospitals. It happened, they said, because the White House task force members asked for more hospital information to prepare for the winter. Teletracking was able to start collecting extra data from hospitals in a matter of days, while the CDC said it would take weeks to make those changes.
A CDC official familiar with the situation disputed those claims, saying that the National Healthcare Safety Network provided excellent data without overburdening already-stressed hospitals. Making the switch to HHS Protect, he said, is “like taking a veteran team off the field to replace that team with rookies. You get a lot of rookie mistakes.”
The hospital data dust-up aside, some CDC officials remain skeptical of HHS Protect.
“It is a platform. It isn’t a panacea,” said a CDC official familiar with the system who didn’t want his name published because he wasn’t authorized to speak to the media. Some of the outside data sources HHS Protect depends on — including the hospital data from Teletracking — aren’t reliable, the official said, sometimes showing, for example, that a hospital had a negative number of patients in beds. “We’re seeing enough of it to warrant overall big-time concerns about the hospital data quality.”
Some are also concerned about the system’s ability to guard patient privacy: More than a dozen lawmakers sent a letter to HHS Secretary Alex Azar in July questioning how HHS Protect would protect individuals’ privacy.
But officials say HHS Protect contains no personal information on patients or others. It tracks users’ every interaction with the data and blocks them from datasets they don’t have authority to see, allowing the federal government to guard privacy and prevent data manipulation, sources familiar with the system said.
The Trump administration adopted data principles in 2018 that include promoting “transparency… to engender public trust.” But much of the data in HHS Protect remains off limits to the public, glimpsed only in leaked reports and occasional mentions by White House task force members. The platform’s public web portal displays the hospital bed data that caused so much controversy this summer but little else. Observers of all stripes, from Frieden to the conservative Heritage Foundation, have called for the Trump administration to make more of its data public.
Van Panhuis said HHS Protect clearly was designed with federal government users in mind, not academic researchers or the public.
“It’s a bit disappointing,” he said. “Currently we have to invent that part of the system.”
Basic data about the pandemic contained in HHS Protect remains secret and is sometimes obscured even from local public health officials. The White House task force’s secret recommendations to governors use HHS Protect data on cities’ test positivity rates, but the White House does not release those reports. And that national dataset is still nowhere to be found on any federal website. When asked, an HHS spokesperson could not point to it.
Some secrecy surrounding HHS Protect data exists for good reason, officials said: Some private companies share their data with HHS on the condition that it will be used to respond to the public health crisis and not be revealed to competitors. And releasing some of the data, even though they contain no personal information, could trigger privacy concerns, forcing officials to redact some of it. For example, it might become obvious whose symptoms were being described in data from a small, rural county with one hospital and one coronavirus patient.
But the secrecy around HHS Protect frustrates transparency advocates who want government data to be shared more openly.
Ryan Panchadsaram, who helps run the coronavirus data website Covid Exit Strategy, would like HHS Protect to publish in one location information on cases, test results and other metrics, for every city and county in the U.S., in an easily accessible and downloadable format.
“Making it available to the public shouldn’t be that difficult,” he said. “It’s a political and policy decision.”
People looking for county-level information — to make decisions about whether to visit grandparents, for example — are often out of luck. And if they want a one stop-shop for state-level data, they must turn to private sources: Panchadsaram said that even employees of state and federal agencies visit Covid Exit Strategy for information on the coronavirus. The state of Massachusetts uses his site’s data to decide which travelers must quarantine when they arrive.
“It is shocking that they come to us when the data is sitting in its purest form” in HHS Protect, he said.
Federal officials, attempting to deliver on at least some transparency promises, say they are working to set up congressional staffers with logins to HHS Protect. Staffers monitoring the pandemic say they have yet to be granted access, though some states are using the system.
The secrecy surrounding HHS Protect also means that outsiders also can’t evaluate whether the platform is living up to its promise. Despite repeated requests from Public Integrity, HHS and CDC spokespeople did not make any officials available for on-the-record interviews regarding HHS Protect.
“The federal government has an obligation to make as much data and information public as possible,” said Hart, of the Data Coalition. “HHS should consider ways to improve the information it’s providing to the American people.”
Zachary Fryer-Biggs contributed to this report.
Help support this work
Public Integrity doesn’t have paywalls and doesn’t accept advertising so that our investigative reporting can have the widest possible impact on addressing inequality in the U.S. Our work is possible thanks to support from people like you.