How America’s Health Data Infrastructure is Being Used to Fight COVID-19

Introduction: Background on Real-World Data

While the COVID-19 pandemic has resulted in enormous suffering and cost, it also has been a catalyst for changes that healthcare industry veterans, innovators, and patients have spent decades advocating for, and which are now happening in a matter of months.

  • Which drugs are being prescribed off-label to COVID-19 patients, and which ones correlate with reduced hospitalization, ventilator use, and/or mortality?
  • How do other health conditions impact the likelihood of mortality from COVID-19?
  • How does COVID-19 status impact the progression of other diseases?
  • What pre-existing prescription drug usage is correlated with protective benefits from hospitalization, ventilator use, and mortality?
  • How do socioeconomic factors (race, gender, urban vs. rural, income level, etc.) impact risk of hospitalization and mortality?
  • Does the presence of a caregiver in the home make a person more or less likely to be tested, be hospitalized, and/or have a bad outcome?
  • What subset of patients is most likely to end up in the ICU?
  • Do veterans who received community care versus VA care differ in risk of mortality?
  • Does living in public housing put patients at higher risk of being infected with COVID-19 and/or dying?
  • What pre-existing prescription drug usage is correlated with protective benefits from getting COVID-19?

Part 1: Major Real-World Data Types

As we’ve written about in the past, data in the US is distributed across a complex ecosystem of thousands of institutions, captured in this diagram:

  • CDC’s National Death Index (NDI) is the premier data set in both coverage of deaths in the United States, and in depth due to its inclusion of the cause of death. Cause of death is important in precisely measuring mortality related to a specific condition — without it, mortality due to accidents and other causes could be attributed to use of a therapeutic or medical procedure, or to disease progression. Unfortunately, the NDI has only been available to academic researchers to date.
  • Social Security Administration’s Limited Access Death Master File (LADMF) is built from deaths reported by states for the purpose of making accurate social security payments, but many states have opted out of such reporting over the last decade. While the LADMF records mortality events going back more than fifty years, its coverage in recent years has fallen to just 15–20% of all U.S. deaths. Further, access to the weekly-updated LADMF data set is limited to select approved use cases.
  • Obituary data is publicly available through website and newspaper postings, and can be obtained by scraping internet sites or licensing it directly from obituary-posting service providers. Aggregated obituary data offer coverage of >70% of mortality events in a timely fashion, but do not contain a cause of death and likely under-represent the poor.
  • Life insurance providers record mortality events when a claim is filed, and often a cause of death is recorded in their data. However, the life insurance industry is highly fragmented, making it difficult to assemble a comprehensive set of data. This source of mortality highly under-represents the poor and the young.
  • Solo and small physician practices: To claim the incentive payments offered under the Affordable Care Act to install EHR systems, many small practices chose free EHR software from vendors like Practice Fusion. These data can often be accessed directly from these vendors, with good standardization across the data set since little to no software customization is provided.
  • Medium-to-large ambulatory physician practices: Larger practices are able to afford more advanced EHR software, but often opt to not customize the installation, leading to strong standardization across providers as well. These data can likewise be accessed directly from an EHR vendor (e.g., Allscripts), but may also be accessible through provider quality benchmarking and cost-optimization services (e.g., HealthJump or Health Catalyst) who process the EHR data of these providers to measure operational efficiency.
  • In-patient practices and hospitals: Academic medical centers and large health systems serve a diverse and complicated set of patients, and therefore install highly customized EHR systems. While there are some suppliers (e.g., Epic) who specialize in these installations, these vendors often do not have rights to the data. Therefore, each provider must make their data available to researchers. Some networks like PCORnet have been created to make data accessible across this sector for approved research efforts.
  • Long term care: Skilled nursing facilities and other long-term acute care facilities care for an especially vulnerable patient population, and are served by specialized EHR vendors like PointClickCare.
  • Government care: Active duty military personnel and veterans may receive care at VA clinics or in the community setting. The VA record set is therefore an important data set for studying this population, as well as for studying diseases that are prevalent among veterans like PTSD and traumatic brain injury.
  • Specialists: Some specialties have specific data recording requirements, and as such, there are EHR systems that have been customized to serve these sectors. These EHR systems frequently capture data necessary for understanding diseases tied to those specialities, for example, tumor measurements for oncology patients. For example, FlatIron provides a system specifically designed for oncology practices, Nextech is optimized for ophthalmology, and TherapyNotes is focused on mental health practices. It is also important to know that some of these systems can be expensive, and therefore may not be used in solo and small specialty practices.
  • Revenue cycle management (RCM): RCM software is used at provider facilities to help practices submit claims to insurers, bill patients, and track payments from both. There are numerous vendors offering RCM solutions, including companies such as Ability Networks, Waystar, and OfficeAlly. Claims gathered from these sources are timely, but geographic coverage can vary depending on the practices using the service. For example, Office Ally has national coverage, but is especially strong on the West Coast. Firms providing operations analytics to providers may also have data use rights to the medical claims sent to them as part of these analyses, and can be an access point as well.
  • Medical claims clearinghouses: Once a claim is generated, it is routed by special software to the correct payer for payment. This routing service is called a claims clearinghouse or a “switch” (note that some RCM services are integrated with claims clearinghouse services). There are only a few major claims clearinghouses in the United States, meaning that accessing data from any one of them gives a researcher data about an enormous portion of the population. Data is typically available in just 1–2 days post submission, making it one of the most timely data sources for measuring rapidly developing events like the spread of the COVID-19 pandemic.
  • Private payers: Payers are the final destination for a medical claim, and a great source for seeing all of the medical activity for a patient regardless of the switch the claim was processed through. For a longitudinal analysis of a patient across their entire journey, payer-based claims are often the optimal source to use. However, because Americans change insurance every 2–3 years on average, the longitudinal analysis that can be done on a single payer data set is often limited). Unfortunately, payers can be slow to process their claims, and some claims are still filed outside of the digitized process described above. These issues mean that payer-based claims can have a lag time of up to 3 months and are sub-optimal for measuring rapidly developing public health situations. Private payer data can be difficult to access as most payers do not make it commercially-available, but there are several large data sets available through Optum (based on United Healthcare’s subscriber base) and IBM (through their MarketScan data, gathered from risk-bearing employers). When evaluating payer data sets it is also important to note that some payers receive claims directly from providers (not through a clearinghouse), the most important of which are the “Blues” (Blue Shield and Blue Cross plans).
  • Federal government payers: More than a third of Americans receive their insurance coverage through the government (Medicare for the elderly and disabled, Medicaid for low income households, and TriCare for the military). Some Medicare data is available through the Medicare Advantage programs administered by private insurers (see above). Medicare data can be accessed through CMS’s qualified entity (QE) program with some important restrictions.
  • State government payers: Medicaid services are often administered by private insurers in each state, and therefore these data may sometimes be found in claims accessed from private payers. There are also some RCM vendors with a large installation base in Medicaid providers, such as Ability Networks. There are also special programs like 340B that provide payments for disadvantaged patients. 340B data gives insight into the diagnosis and treatment of underserved patients typically receiving care at free health clinics. This data is particularly useful for understanding access to care for disadvantaged populations.
  • Retail pharmacy chains: Retail pharmacies like CVS and Walgreens generate a multitude of pharmacy claims for the drugs they dispense (liquids, pills, etc.), but most have entered into contracts with pharmacy claims aggregators and this data is hard to come by elsewhere. That being said, some of the same information may be gathered by operations software installed at the pharmacy, such as McKesson PTS’s pharmacy services.
  • Specialty pharmacies: Many of the new drugs launched in the United States are specialty products, which are injected or infused. These products require additional patient support, and are therefore dispensed by specialty pharmacies who can provide those services. However, because pharmaceutical companies contract with a subset of specialty pharmacies to dispense each brand, those companies often own the data rights to this dispensing information. To access it often requires having the pharmaceutical company contract with the specialty pharmacy to deliver the data.
  • Hospital pharmacies: Hospital pharmacies dispense drugs for inpatient treatment, but these data are hard to access as hospitals do not tend to grant data use rights to their vendors (similar to their stance on their EHR data).
  • Pharmacy claims clearinghouses: Like medical claims, pharmacy claims are electronically routed to the proper payer for payment. Much of this data is also exclusively licensed by pharmacy aggregators and not available except through those sources.
  • Payers: Similar to medical claims (discussed above), payers are great sources for seeing all of the prescriptions a patient has filled, but there is a substantial time lag and patients are again lost once they change insurers (every 2–3 years on average).
  • Pharmacy benefit managers (PBM): Pharmacy benefit managers that act on behalf of payers to determine which drugs are covered, and to administer their pharmacy coverage, may have access to pharmacy claims to measure the impact of their programs. However, it is rare that PBMs make this data available without the payer’s permission.
  • Government (local, state, and federal) agencies provide critical support programs for vulnerable populations, and these data can be used to understand critical social determinants of health such as food and housing security. Data from agencies like Housing and Urban Development, from programs like SNAP (food stamps), and similar data can be hard to access outside of research settings.
  • Loyalty card programs from grocery stores and other retailers collect purchasing behavior from large groups of Americans, and can give insight into the amount of alcohol or tobacco use, diet, and use of over-the-counter medications.
  • Financial demographics data is collected from financial institutions and credit ratings bureaus, and can be used to segment patients by their ability to pay and to understand the correlation of income to healthcare access and outcomes.
  • Household demographics data is collected from a variety of sources, and includes important information about the living environment of a patient such as whether they live alone or with a caregiver, in a high crime area, have high exposure to allergens or toxins, etc.
  • Patient surveys are regularly used to collect information about their attitudes, behaviors, and needs. These include custom designed data collection instruments, as well as standardized questionnaires like Kantar’s National Health and Wellness Survey.
  • (And many more…)



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store