An egg may look ordinary on the outside, but inside it could be a chicken, a beautiful white swan, or perhaps a rare black swan. Diseases are not so different. A person may have a cough or a sore throat but, as we learned in 2020, not all colds are alike and it may not be obvious from the outside which virus is causing the symptoms. COVID-19 has been far from ordinary—killing millions of people in two short years—and its emergence seemed to catch many completely by surprise.
Mathematician and scholar Nassim Nicholas Taleb coined the term “black swan theory” to explain events that are unpredicted, have massive consequences, but make sense in hindsight. The name is a nod to the 17th century discovery of black-feathered swans, which at the time no one thought existed—but they just had yet to encounter one. COVID-19 has definitely had massive consequences and seemed to catch society off guard; as a result, some people initially gravitated to the metaphor, suggesting the pandemic could be a black swan.
Pathogen fitness, host susceptibility, and environmental conditions are three major contributors to disease spread.
However, some scientists were, in fact, not surprised in 2020. They had been anticipating a respiratory virus would soon cause another pandemic, it was just a matter of not knowing exactly which one, or when. Taleb himself agrees that COVID-19 is no black swan. But logic begs the question: if pandemics are expected, how can we predict when the next one will emerge? Can science help us “see inside the egg”?
Identifying which pathogens have the potential to cause a pandemic is not easy. Scientists can study how a disease spreads or look at its genetic sequences, but it takes a lot of analysis to deeply understand the connections between these data—the clues are not obvious black feathers. Scientists at Los Alamos, alongside thousands of peers worldwide, are contemplating all approaches to predicting pandemic potential: Are there indicators in the way a disease spreads? Can we build faster methods for finding pathogens in the environment or better modeling tools to forecast their spread more accurately? And once pathogens are identified, can their genomes tell us which ones are the most dangerous?
Using long-standing capabilities and unique computational tools, scientists at Los Alamos are tackling each of these important questions, one by one.
Pathogens are everywhere
Not all infectious diseases cause pandemics. Occasionally a disease that is endemic, or established, in one geographic location shows up in a different location where the population has less immunity, causing a local outbreak of a “new” disease. Influenza, on the other hand, is a relatively manageable, seasonal epidemic in many countries, but the virus mutates constantly and occasionally causes larger problems. In 1918, a new strain of flu emerged for which humans did not have much immunity. This novel strain spread at a rapid pace causing a pandemic that killed approximately 50 million people. Similarly SARS-CoV-2, the virus that causes COVID-19, caused a pandemic in part because it was a coronavirus to which humans had no prior exposure.
Epidemiology is the study of how a disease spreads through a population and is based on disease behavior and pathogen identity. For hundreds of years scientists have gathered data on disease spread at varying levels of detail—from John Snow mapping the location of cholera cases in London in the 1800s, to the daily COVID-19 case collection that has happened worldwide for the last two years. The identity of a pathogen used to be determined by symptoms or visual identification of cells under a microscope, but now it can be done by determining the pathogen's genetic sequence and comparing it to those of other known pathogens.
Both epidemiology and genomics rely on information about where people are getting sick. Public health systems track disease outbreaks by monitoring when people go to the doctor and collecting the results of their diagnostic tests—for the flu or strep throat for example. But many mild infections, like colds, don't warrant a doctor's visit so they aren't documented and those diseases spread freely. Emergent pathogens, such as SARS-CoV-2, might not get detected until enough people are hospitalized to prompt scientists to look for the culprit. As evidenced during the COVID-19 pandemic, case counts are most accurate in places with widespread testing and thorough documentation.
Detecting disease quickly is key to avoiding a potential pandemic, but it is also necessary to understand the magnitude of a threat in order to appropriately gauge the response. Ebola, for instance, is frightening and deadly but not as transmissible as respiratory diseases. Furthermore, many new variants of SARS-CoV-2 have emerged over the last two years, but only some of them—Delta, Omicron—have become major players in the pandemic. So, while global disease surveillance is critical for discovering new diseases quickly, it is also important to recognize which of them are genuinely cause for alarm.
Scientists at Los Alamos are studying both disease behavior and pathogen identity—as well as their interconnectedness—to understand pandemic potential. Through multiple projects, Lab scientists are assessing epidemiological and genomic clues as well as developing models and detection schemes to help predict whether or not one pathogen poses a greater threat than others.
Nature or nurture?
One important step in predicting pandemics is understanding the factors that contribute to disease spread. If a newly mutated coronavirus infects a person who already has antibodies that can recognize and neutralize it, the pathogen might not spread any further. Or if a novel pathogen infects a person who is isolated on a desert island and kills them before they interact with another person, the disease stops there. Pathogen fitness, host susceptibility, and environmental conditions are three major contributors to disease spread—but are all factors equally important or does one matter more than the other two?
Los Alamos biologist Alina Deshpande and her team tackled this question using a visual analytics tool they developed in 2012 called AIDO (Analytics for Investigation of Disease Outbreaks). The tool includes a database of detailed epidemiological information about more than 600 outbreaks of 32 distinct infectious diseases—measles, cholera, Ebola, etc. The scientists designed AIDO to help researchers understand and respond to new disease outbreaks by comparing them to historical ones.
In December 2021, Deshpande and colleagues Nileena Velappan and Katie Davis-Anderson used AIDO to search for “potential black swan outbreaks.” The team hypothesized that there might be common features among exceptionally large outbreaks that, if identified, could serve as warning signs of future pandemics. They deliberately excluded SARS-CoV-2 data from their analysis with the thought that they wanted to know what could be gleaned from the past that might have foretold the present.
First, the team defined a potential black swan outbreak (PBSO) to be an outlier with more than ten times the number of cases than other outbreaks of the same disease. “We were interested to find that there was an outlier event in almost every disease in our database,” explains Velappan. Next, the team identified differentiating factors for each event and classified them under three categories: pathogen, host, or environment (including manmade infrastructure and behavioral factors).
One example from their study was an outbreak of mumps, a vaccine-preventable disease. AIDO identified a PBSO in 2003, in the United Kingdom where mumps caused over 70,000 cases compared to a mean of 207 cases for other mumps outbreaks. For this PBSO, the Los Alamos team attributed two factors: one was host susceptibility because many afflicted children were too young to be vaccinated, but the second was vaccine hesitancy—a manmade behavioral factor.
“What surprised us the most was that the single common factor to all the potential pandemics was the behavioral aspect,” says Deshpande. “PBSOs were not happening just because it was the first instance of a new disease, they were consistently linked to human behavior.” She went on to explain another example, when in 1994, plague was identified in the city of Surat, India. News of the presence of plague caused a quarter of Surat’s residents to panic and flee the city—many by public transportation—ultimately spreading the disease to other parts of India. Had they instead locked down and isolated the sick for treatment, it might have ended differently.
“This aligns with what we see in the COVID-19 pandemic where human behavior, such as the timing and extent of lockdowns or mask wearing, has been a major factor in disease spread,” says Deshpande. “And unfortunately, predicting human behavior could be the most difficult of all.”
Slurry for surveillance
Although human behavior or other environmental conditions might ultimately halt or accelerate disease spread, a novel pathogen and a susceptible population are still a dangerous combination. Scientists identify novel pathogens by sequencing their genomes and comparing them to closely related pathogens. When the sequence of SARS-CoV-2 was first made public in early 2020, scientists quickly recognized it was related to the virus that caused SARS in 2003. Diagnostic tests were developed based on the new sequence, and subsequent sequencing from COVID-positive patients around the world has helped scientists keep track of how the viral genome is evolving as the virus is spreading.
This practice of sequencing samples from COVID-19 patients is immensely helpful in identifying new variants of SARS-CoV-2, but it relies on patients’ active participation: they have to feel sick and get tested so that the scientists have a supply of positive samples. And testing and sequencing must be available in the area. To better track COVID-19, a more passive approach to surveillance has gained momentum during the pandemic: looking for viral genome sequences in the sewers.
“We can generally find a spike of SARS-CoV-2 sequences in wastewater a week before it begins to spike in the clinical testing data,” says Julia Kelliher, a Los Alamos biologist who is part of a team that has been field testing the surveillance of wastewater. In 2021, the Lab participated in a nationwide effort with the Centers for Disease Control and Prevention to analyze wastewater by providing samples from the Los Alamos campus. Based on the study’s success, the Lab is now embarking on its own project to conduct in-house sequencing and analysis of wastewater to both improve the practice and to screen the Lab community at the same time.
Scientists are improving ways of distinguishing between threats, finding them quickly, and assessing their potential trajectories.
Screening wastewater for SARS-CoV-2 sequences is a little tricky, as the virus’s genetic material degrades, meaning fewer pieces are available to find and they are mixed together with a lot of other biological material. To make this type of surveillance work, scientists develop short, target pieces of DNA, called primers, that match specific known gene sequences from SARS-CoV-2. If a primer ’s matched sequence is present, the primer binds to it, allowing that particular sequence to be identified and analyzed. Careful analysis can help scientists develop multiple primers for a pathogen, which increases the chance that they’ll find a piece that is a match.
“We are also using Los Alamos bioinformatics tools to create primers that can help us look for multiple pathogens at once,” says Kelliher. “If we find something of interest, we can study it carefully. We really want to pick up rare things and not just organisms we already know about, and to do that we need to create many different primers and unique methods of meta-analysis.”
Finding pathogens we don’t already know about is key to predicting the next pandemic. For years, virus hunters have searched for new threats by sampling genomic sequences from viruses found in animal populations, focusing especially on those wild or agricultural animals that are most likely to come in contact with humans and cause spillover events. Human proximity to animals also increases when animal habitats are destroyed by development, or climate changes force animals to seek new homes. Virus hunters continue to play a key role in surveillance, and passive wastewater testing could be a major boost to the efficiency of their search for new threats.
But the question remains: how do they know a new threat when they’ve found it?
A chameleon among us
To date, more than ten million SARS-CoV-2 genomes (and counting) have been sequenced and cataloged for the scientific community to analyze. In addition, a mountain of data is available on the behavior of COVID-19 in humans, as case numbers and death counts are collected daily from all corners of the globe. Putting aside the gruesome fact that people are dying, this unprecedented amount of data—more than is available for any other pathogen—presents a unique opportunity for scientists to study how diseases evolve and spread, which could ultimately help with prediction.
Los Alamos computational biologists Ethan Romero-Severson, Emma Goldberg, and their colleagues have been using both genomic and epidemiological SARS-CoV-2 data to develop computer models that quantify the advantage one virus variant may have over another based solely on case data. Without diving into the nitty-gritty details of the variants’ genomes, their team developed three different modeling approaches to quickly evaluate the timing of when variants arise and which variants might be a significant threat.
“Once a new variant was first observed anywhere, we wanted to measure how long it took to become dominant in each country,” says Romero-Severson. “If we can identify a pattern of spread consistent with epidemiological theory, then we can accurately determine the risk posed by new variants before they become globally dominant.”
The team created efficient models that did not require a lot of computing power but could rapidly assess the risk of a new variant based on how quickly cases of one variant are rising in a specific area compared to other variants. Using data on the emergence of the first major variant of SARS-CoV-2, “D614G”, the team’s models confirmed D614G had a selective advantage over the other variants Alpha and Beta that were circulating at the time. Goldberg explains that when there is an increase in the prevalence of a new variant, it is important to determine whether that variant actually has a selective advantage, as opposed to alternative explanations such as a chance arrival in a community experiencing an outbreak for other reasons. This distinction helps focus the scientific community’s attention on the most important “variants-of-interest.”
Once a variant-of-interest is identified, scientists use many types of epidemiological models to forecast its spread over the span of weeks or months to help predict if it will be a worldwide issue. Another team of scientists at Los Alamos is now developing models that take forecasting to the next level by incorporating specific details about a community—population data and vaccination rates—along with variant information to create even more sophisticated analyses.
Sara Del Valle, Los Alamos mathematical epidemiologist, helps lead a new project that focuses on merging evolutionary tracking of SARS-CoV-2 with epidemiological forecasting. She explains that when a new variant arrives in a community, it can take a while for traditional models (which are reliant on case data) to reflect differences in the new variant’s spread. And because time is of the essence, the sooner scientists can forecast potential impacts, the more time decision makers can have to implement policies aimed at reducing economic and societal impacts.
“We want to have more accurate and timely forecasts, so we are developing new models that include data from other countries with shared characteristics,” says Del Valle. “For instance, when I was forecasting the spread of Omicron BA.2 in New Mexico I might have looked at Slovenia data because those two places had similar sized BA.1 peaks, have similar vaccination rates and had the BA.2 variant introduced when the BA.1 cases had already dropped.”
For two years, modelers have been scrambling to assess and predict the spread of SARS-CoV-2 while it has been spreading and evolving. What they’ve learned from this unprecedented challenge, however, is invaluable and is laying the groundwork for the highly sophisticated models of the future.
Lost in translation
Assessing which variants are spreading fastest or causing the most severe disease is one way of predicting the pandemic potential of a virus. However, scientists also want to answer deeper questions about which exact changes give a selective advantage to some variants over others so that when a new virus is discovered, scientists might better evaluate risk. The abundance of available SARS-CoV-2 genomic data is making it possible for scientists to dive into the fundamental differences among variants.
For instance, the world quickly learned that spike proteins on the surface of the SARS-CoV-2 virion are responsible for the virus attaching to human cells. Later, we learned that the Delta and Omicron variants had amassed significant mutations in the part of the genome that determines the shape of the spike protein. These genetic changes resulted in a spike shape that increases the virus’s ability to infect cells, and, being more “fit” to travel from person to person, Delta and Omicron variants took over while other variants died out.
This level of detail about how changes in genes (genotype) equate to changes in function or behavior (phenotype) is, generally speaking, one of the most sought-after prizes in modern biology. The fact is: scientists can easily obtain the sequence of an organism’s genome, but it takes a lot of observational and experimental data to fully understand what it means. It is still incredibly difficult for scientists to interpret the genome sequence of a never-before-seen virus and determine if the virus will be transmitted in the air, what its host animal is, or if it will cause an international crisis.
Taking advantage of the current influx of data, scientists hope to learn from SARS-CoV-2 and apply this knowledge to other pathogens. Los Alamos computational biologist Bin Hu was inspired to explore predicting functions from genes using a set of experimental results from the Fred Hutchinson Cancer Research Center in Seattle, Washington. The “Fred Hutch” scientists conducted lab experiments to identify how well certain antibodies bind to the spike proteins from different SARS-CoV-2 variants and to link the results to the genetic sequence that made each antibody and spike. Hu saw this comprehensive dataset as an opportunity to test machine learning (ML) as a method for predicting phenotype from genotype.
“I split their data into two groups,” says Hu. “I used about eighty percent of their data to teach my ML model which genetic sequences create spike protein shapes that will bind to antibodies. Then I used the remaining twenty percent to validate.” Hu explains that to validate his algorithm he used an RNA spike protein sequence as an input, and then asked the algorithm to predict whether or not an antibody would bind to it. Next, he used the Fred Hutch experimental data to see if the model gave the correct answer.
Hu says his ML model was quite accurate with its predictions and he is continuing to collaborate with the Fred Hutch scientists on this approach. Hu’s team hopes to expand their predictive model to other viruses such as HIV and influenza for which there is additional experimental data. Increasing the number of known gene-to-function links ultimately helps scientists know what they’re looking at when they come across new pathogens. For SARS-CoV-2 it is clear that changes in the spike protein must be monitored. Scientists need to find those types of “signatures” in other pathogens—genes which they can confidently link to a function of concern—so that identifying the presence of those signatures in a newly discovered organism would be a harbinger of a dangerous pathogen.
Eager to use SARS-CoV-2 data to learn even more, scientists are also wondering if genetic mutations arise truly by chance or if there is some external pressure that makes certain changes more likely to occur than others. Naturally, mutations in the spike protein that increase transmissibility give an evolutionary advantage and lead to dominant variants, but could there be other pressures that favor one mutation over another? The abundance of genomic data available in near real time during this continuing pandemic is enabling scientists to infer mutation patterns and then make predictions about what novel mutations might happen next. Scientists can assess the accuracy of these predictions using newly sequenced SARS-CoV-2 genomes that are deposited daily in global databases.
Lab computational biologist Jason Gans is doing just that: using SARS-CoV-2 data to study viral evolution. Gans wrote a computational algorithm to extract patterns of mutations in SARS-CoV-2 that will estimate the likelihood of future mutations. “This is a unique opportunity not to have to wait for validation data because new data are being made available every day,” says Gans.
The computational model identifies “parent” genomes and catalogs the observed genomic differences in each subsequent generation. For instance, the model learns how many times an adenine “A” nucleotide is found to have changed to a uracil “U” (in RNA viruses like SARS-CoV-2, the alphabet of allowed nucleotides is A, U, G and C). Next, Gans uses the model to make predictions about the likelihood of mutations in future children (genomes that are circulating in the population but not yet sequenced and available to study). Since new SARS-CoV-2 genomes are being added daily to community databases, Gans will quickly be able to measure the accuracy of his model’s mutation predictions.
The unprecedented amount of data on COVID-19 is a unique opportunity for scientists to study how diseases evolve and spread.
“We wanted to break this down into a probabilistic problem and predict the most likely mutations that will be observed in the future evolutionary descendants of an existing SARS-CoV-2,” says Gans. He explains that it is also important to understand where, in the genome, the mutations are occurring. Does a likely mutation depend on what sequence motifs are nearby? Are mutations more likely to occur in particular regions of the genome? Gans says his research is corroborating others’ observations that C to U mutations in SARS-CoV-2 are extremely common (likely due to humans’ immune systems causing changes in the virus).
Mutation data are useful for improving countermeasures. Some vaccines are made using a pathogen’s unique genetic sequence—as are many diagnostic tests—so understanding viral evolution at the level of specific nucleotide changes is especially valuable to ensure tests or vaccines won’t fail when the virus inevitably evolves. Furthermore, understanding which mutations are more likely, coupled with the growing body of data about which genes are most dangerous, leads scientists in the direction of predicting which pathogens are new threats.
Seeing inside the egg
The SARS-CoV-2 virus accumulated the right mix of genes to make it well suited for human transmission, and humans proved to be a widely susceptible population. It will happen again. It could be another coronavirus, or an influenza virus, or perhaps something completely new. And once again, its trajectory will be determined by both the identity of the unique pathogen and the behavior of the disease in the population it infects.
Preparing governments and public health systems is arguably the most important way for humans to be ready to face this next threat when it comes. These infrastructures are critical to pandemic response, but the science also shows the significance of human behavior on disease spread and the role of other environmental conditions—such as proximity to animals and climate changes—on disease emergence. Understanding these variables is essential to anticipating disease behavior. Beyond this, studying the genes and functions of pathogens will help scientists recognize threats when they come along, and implementing robust surveillance, testing, and communication in all countries will facilitate action.
Building on this pandemic, scientists hope to be better prepared to recognize the next threat when we see it; maybe even before it makes the jump to humans. Biology is complex and there still could be something out there that takes us by surprise—perhaps a dragon inside the egg—but if scientists can learn which clues to look out for, they will hopefully be ready with a dragon-sized net.