Skip to content

By Roger Highfield on

Coronavirus: The Global COVID-19 Observatory

For the first time, scientists can see a pandemic evolve in real time at the genetic level, revealing ‘variants of concern’ while guarding for large-scale genetic changes in COVID-19 that might occur by a process called recombination.

Every time a person becomes infected, there is a chance for the virus to mutate and reinvent itself, with a potential impact on how it spreads, its virulence and the effectiveness of vaccines.

The UK is leading the world in tracking the evolution of the SARS-CoV-2 virus, with the COVID-19 Genomics UK (COG-UK) consortium having already sequenced around 180,000 viral sequences and counting.

The rest of the world has contributed around 170,000 virus sequences, creating a vast COVID-19 family tree, including the B.1.1.7 variant first seen in Kent which is becoming dominant in the UK.

Last month, the Government announced that the UK will offer genomics expertise to countries who do not have sufficient resources, an effort led by Public Health England, NHS Test and Trace and academic partners as well as the World Health Organization’s SARS-CoV-2 Global Laboratory Working Group.

I talked to Oliver Pybus, of the University of Oxford, a member of one of the many teams analysing the genetic code of the SARS-CoV-2 virus, to chart its spread worldwide and the emergence of ‘variants of concern’. His answers are in italics.


One of the reasons COG-UK got off the ground so well was that all of the necessary expertise was in place already. The UK has world-leading experts in virus genomics, from the logistics of sample handling, extraction and sequencing, through to very high-throughput bioinformatics and evolutionary analysis.

It would be fantastic to share this expertise with the world. We already have experience in capacity building and training to support scientists in other countries to do that locally.  For example, through two projects (ZIBRA and CADDE), we have helped to create a nexus of expertise in Brazil for portable (nanopore) sequencing. 

However, rather than send samples internationally, I think it is better to sequence locally, in real-time, then share the data internationally.

The emergence of the variants has shown the importance of minimising delay and latency in the system.  We need to go from sampling a patient to a published sequence as fast as possible. If we ship samples internationally, or even transport them across a large country, that really slows things down. So developing unified ways of presenting and sharing data internationally is the model we really need.  

The UK can draw on all this expertise because it played a key role in establishing the field of molecular biology, from pioneering work in Cambridge on the double helix structure of DNA and structural biology, notably at the MRC Laboratory of Molecular Biology in Cambridge, to developing methods to read, or sequence DNA, and innovative technologies to sequence more cheaply and easily than ever before, for instance with portable sequencers developed at the University of Oxford.

Crick and Watson's DNA molecular model, 1953.
Crick and Watson’s DNA molecular model, 1953. This reconstruction of the double helix model of DNA (deoxyribose nucleic acid) contains some of the original metal plates used by Francis Crick (b 1916) and James Dewey Watson (b 1928) to determine the molecular structure of DNA.

The UK also has a consortium working out how these genetic changes affect the behaviour of the virus, and has witnessed a nimble funding response to the pandemic from organisations such as the Medical Research Council, National Institute for Medical Research and the Wellcome Trust.


Detection of variants of concern has, so far, been ad hoc– there is no consistent or consensus method for assessing a new virus genome sequence.

There are hundreds of different genetic lineages of the virus, but we don’t consider each and every one of those to be a variant of concern.

When it comes to the recent variant, B.1.1.7, this was found when Public Health England investigated a cluster of cases in Kent. When their data was cross-linked with genomic data held by COG-UK, the outbreak was associated with an unusual and new lineage of the virus. And this lineage appeared to be increasing in relative frequency compared with others.

The B.1.1.7 lineage also contains an unusually large number of genetic changes and some of those genetic changes were at genomic sites, particularly in the virus’ spike protein, that other work suggests could have phenotypic relevance, that is, they could affect how the virus behaves in terms of virulence and transmissibility.

The spike protein is key to the ability of the virus to invade human cells. Evidence to date suggests that the B.1.1.7 variant transmits up to 40 per cent more easily and preliminary data suggest that it is more lethal too, such that, if 1,000 60-year-olds were infected with the old variant, 10 of them might be expected to die but this death toll rises to about 13 with the new variant.


If you plot the sampling date of each virus against the number of genetic changes it has accumulated from the ancestral virus at the beginning of the pandemic, you can get an estimate of how fast the virus is mutating and evolving (the rate is around two nucleotides- the ‘letters’ that spell out the code – per month).

This plot shows a big splay of points. All of these variants of concern are at the top of the plot: they are in the top 5% of viruses in terms of how many mutations they have accumulated since the beginning of the outbreak.  

What the three virus lineages of concern share is that each of them is characterized by an unusually large number of genetic changes, not just one or two.  

Each of these three lineages has many new mutations, including sets of mutations that we think could be worrisome, as well as a lot of other genetic changes.  

What is interesting is that some of these variants show evidence of parallel or convergent evolution. If we see one specific mutation arise multiple times, that’s interesting but not unexpected, given the number of infections and the rate of virus mutation.

However, there are situations under which we should take much more notice of a new variant. One is when we see not one, but whole sets of multiple mutations, arising independently in different lineages. Mathematically speaking, that becomes fantastically unlikely to have occurred just by chance.

So, there must be a reason for those mutations to be seen co-occurring together. Of particular interest here are the B.1.351 and P.1 lineages that were first detected in South Africa and Brazil. These two lineages share a high number of potential mutations of concern.  We don’t yet know why.

The current variants of concern are: B.1.1.7 (also known as 501y.V1, or the ‘Kent strain’); B.1.351 (also known as 501Y.V2, or ‘the South African strain’) and P.1 (also known as 501Y.V3, or ‘the Brazilian strain’).

These variants of the ancestral SARS-CoV-2 virus have 23, 21 and 17 mutations, respectively, each of which is a single ‘letter’ change in the 30,000 letters of the virus’s genetic code, which is written in the genetic material RNA.

Crucially, there are 8, 9 and 10 mutations in the spike, respectively, which is the protein that the virus uses to invade human cells.

The virus spike – there are between 25 and 40 on the surface of each virus – has evolved to stick to proteins on the surface of human cell types that responds to a human hormone that helps maintain blood pressure called angiotensin – hence their name, angiotensin-converting enzyme 2 (ACE2) receptors.

Overview of the coronavirus. Source: CDC Public Health Image Library. Dr Yorgo Modis, University of Cambridge.

This week, cases of the B.1.1.7 variant first seen in Kent were recorded that also included E484K, the concerning genetic change seen in South Africa.

One of the places where mutations occur is on a part of the spike called the RBD, or receptor binding domain, which is both a target of neutralizing antibodies and the host cell attachment binding site, so it has a dual role.

That means you can have four combinations of effects caused by mutations: mutations that increase the ability of the virus to escape the body’s immune system, and also increase binding; mutations that increase escape but decrease RBD binding, and so on.


I expect some parts of the viral genome will have a greater intrinsic propensity to mutate. For example, some combinations of nucleotide sequences can be more prone to some types of mutations than others. However, that variation is not well understood yet for SARS-CoV-2.

More generally, evolution is a two-step process. The first step is the generation of a mutation. But if that mutation does not allow the virus to mature or leave the cell, then it is harmful to the virus, and we are not going to pick it up in our genomic surveillance. Similarly, we won’t observe mutations that stop the virus from spreading between people.

Therefore the mutations that COG-UK actually observes are clustered in those places on the virus RNA genome that don’t kill or stop the virus. This is an example of ‘survivor bias’, as we only detect the viruses with non-lethal mutations, and was famously described during the Second World War, when the statistician Abraham Wald worked out where to place armour on airplanes.

The profound insight emerged from a plot that had been created for Wald which showed the position of all the bullet holes on fighter planes and bombers that had safely returned to airbases, such as the one owned by the Science Museum Group in Wroughton: the wings and most of the fuselage appeared riddled with bullets, suggesting (to some, not Wald), they needed more armour.

Wald realised that was precisely the wrong way to interpret this data. Just as mutations are relatively random, so he assumed it was unlikely that the Germans could aim precisely at one part of a plane.

Wald reasoned that the data for the plot only came from planes that had successfully returned to base, where the damage could be assessed. All the planes that blew up or crashed must have been damaged just in the places where the data on these surviving planes would never show a direct hit: in the cockpit, killing the pilot; engines, causing the plane to lose power; fuel tanks causing an explosion, and so on.

This flawed interpretation is known as ‘survivor bias’ and is defined as the error of concentrating what variants have survived a selection process and overlooking those that did not – COG-UK can only see mutations in the sites where the SARS-CoV-2 virus can accept some change.


To describe an object moving in time and space, you need four dimensions: three dimensions that describe space and one defining the time.

When it comes to showing the evolution of the SARS-CoV-2 virus you need many more to show the 30,000 ‘letters’ of RNA genetic code at a given time, when each of these letters can be in one of four states (there are four genetic ‘letters’ in the alphabet of RNA, chemicals called nucleotides, which are adenine (A), guanine (G), cytosine (C), and uracil (U)).  Genetic changes invariants can be plotted in this high dimensional ‘sequence space.’

We can show all these dimensions in what we call ‘sequence space.’ Every time the virus mutates, it’s moving or walking within that space. In fact, the mathematical models used to understand virus evolution represent viruses exploring that space, like a molecule of a gas randomly diffusing through the air.

The ancestral virus of the pandemic is one specific point in that sequence space. As more people become infected and mutations accumulate, you can see the pandemic diffuse and spread out across that sequence space. As Wald showed, we won’t see virus sequences that contain mutations that are fatal or strongly deleterious to the virus.

So, if you could plot out all the viable viruses in sequence space, it would look like a very, very bubbly sourdough with thin stretches of viable sequences, and huge holes, which are sequence states that the virus can’t reach. As a result, the path of virus evolution must follow the viable stretches. The pandemic is still young, only a year or so old, and so far it has explored only a limited amount of the total amount of sequence space available to it.  

The total genetic diversity of SARS-CoV-2, that is, all the genetic differences accumulated over 12 months among a quarter of a million genomes, is truly dwarfed by the genetic diversity of hepatitis C, another RNA virus (you can get a sense of the difference from this illustration).

In this context, the SARS-CoV-2 variants of concern have taken a big leap in sequence space in a short period of time. We don’t yet know how that jump occurs.

One hypothesis that we’ve put forward is that a large number of mutations can accrue very quickly in certain individuals who are chronically infected and/or immune-suppressed, so their immune system does not work so well. The selective environment for the virus within those individuals may be very different to that in a normal infection, and could allow a large amount of virus genetic diversity to build up within those people.

By comparison, a normal infection is relatively quick and the virus is passed on before genetic diversity builds up.

In addition, if immunocompromised patients are given treatments like antibody therapy, that then could impose a strong selective pressure on a very large and diverse virus population, which could lead to the appearance of many mutations in a short period of time.

This is only a hypothesis and we don’t really know if this has happened. Our speculation is based on the data from a series of studies that have shown that in these chronically infected and/or immune-compromised patients there is a rapid accumulation of genetic changes.


We must keep an eye on recombination. In this process, a patient is infected with two variants of SARS-CoV-2, which can become mixed, thereby producing new combinations of mutations in a short period of time.

This transmission electron microscope image shows SARS-CoV-2, the virus that causes COVID-19, emerging from the surface of cells
This transmission electron microscope image shows SARS-CoV-2, the virus that causes COVID-19, emerging from the surface of cells.
The image was captured and colorized at NIAID’s Rocky Mountain Laboratories (RML) in Hamilton, Montana.

This genetic process of recombination is like getting two card decks, one red and one blue, and splitting and swapping the decks, so half of each deck is red, and half blue.

We don’t have any evidence that any of the three variants of concern arose from recombination.

But as the pandemic proceeds, we should be open-minded about the possibility of recombination because human coronaviruses, in general, do recombine. We can see recombination in the close relatives of SARS-CoV- 2 that have been found in bats and pangolins.

For recombination to be detected, you’ve got to have high rates of infection so that there is a reasonable chance that one person can be infected with two variants at the same time that are quite distinct from each other.

Early on in the pandemic, when all of the strains were very similar to each other, even if recombination had occurred, we wouldn’t really notice. But now that the virus genetic lineages are more divergent from one another, it will be more apparent if recombination is taking place.

We don’t see any evidence of recombination yet in the very large UK dataset, or in the variants of concern. Given the intensity and frequency of sampling in the COG UK data set, we should have seen if two parent lineages in the UK combined to give a big leap in sequence space.


The picture remains the same as a year ago, that bats and/or pangolins from east or southeast Asia are likely candidates. Only just a few days ago a new bat coronavirus was reported from Cambodia that is closely related to SARS-CoV-2.  

Though the focus of concern was originally the Hua Nan seafood and wet animal wholesale market in Wuhan, China, how that leap between species took place is still a matter of speculation, though most agree that the opportunities for animal viruses to infect humans are rising.


One of the deletions in lineage B.1.1.7 occurs in the probe binding site of one particular PCR test. This test has three gene targets, and if the test is run on B.1.1.7, then two of the targets are positive and one is negative. This is not the only virus in the UK (or internationally) that exhibits this behaviour, but the other UK lineages with this deletion evolved separately and were circulating earlier in October and November.

It’s safe to assume that almost all the “SGTF” tests in the UK (where the S gene target is negative) were caused by B.1.1.7 infections from the beginning of December onwards.  

But a variety of different tests are used for the virus and, given all the crosstalk between the Lighthouse labs (which do the PCR tests) and COG-UK, we can quickly identify viruses that do not test positive for all targets. 

There are lots of other tests that don’t use those particular three targets. The testing system is robust – not everything is just reliant on one particular technology.


The World Health Organisation held a consultation on new variants on 12 January and it was clear that everyone saw the need for a more consistent way of naming and identifying new variants of concern. A follow-up meeting took place this week.

Defining a variant of concern should take into account multiple streams of evidence: epidemiological data on whether there’s an uptick in cases; genetic data on whether there’s an unusually large number of genetic changes and whether those changes are at sites that are particularly important; and experimental evidence from in vitro (in the lab) or in vivo (animal testing) experiments on the functional relevance of those mutations.

Those streams of data need to come together and then go through a “traffic light” system, which would see a variant move through levels of importance and certainty, based on the evidence available.

The last thing that we want to do is to be flagging too many viruses, because we don’t want to be chasing too many leads. The proof of the variant pudding is in the epidemiology. When a variant takes off, that demonstrates that it’s relevant, not just at the cellular level, not just at the organism level, but at the level of the whole population.

With genomic epidemiology, we’ve created an incredible new tool, an observatory that reveals the genetic life of the virus in real time. However, we don’t yet have formal response mechanisms, in terms of public health and decision making, to process all this new information that’s being generated in real time. It would have been nice to consider outside of a pandemic but, as they say, necessity is the mother of invention.


The latest picture of how far the pandemic has spread can be seen on the Johns Hopkins Coronavirus Resource Center or Robert Koch-Institute.

You can check the number of UK COVID-19 lab-confirmed cases and deaths along with figures from the Office of National Statistics.

There is more information in my earlier blog posts (including in German by focusTerra, ETH Zürich, with additional information on Switzerland), from the UK Research and Innovation, UKRI, the EUUS Centers for Disease ControlWHO, on this COVID-19 portal and Our World in Data.

The Science Museum Group is collecting objects and ephemera to document this health emergency for future generations.