Skip to content

By Roger Highfield on

Genetic data bias, why it matters and how to fix it

This week we explore bias in genetic databases in an online event. To discuss why this is so important for our future health, Science Director Roger Highfield talks to Professor Naomi Allen, Chief Scientist of UK Biobank.

Genetic research has long been seen as a key tool to help people to live longer and better. This week, as part of our Open Talk series, we ask whether genetic databases benefit the whole population equally.

UK Biobank, with half a million participants, stands out as a monumental endeavour in this regard, offering researchers around the world an unparalleled resource of genetic and health data from half a million British volunteers.

However, the chief scientist, Prof Naomi Allen, is the first to admit that, because it relied on volunteers, it is not representative of the UK population and ‘nor was it designed to be.’

‘I don’t know of any volunteer-based research study that is representative because the type of people who volunteer are not the general public, by and large: the UK Biobank cohort has slightly more women than men, is slightly older, having recruited 40–69-year-olds, slightly wealthier and slightly healthier than Joe Public.’

Indeed, one study by the University of Oxford showed that the willingness to participate in such studies is in your genes.


UK Biobank which is led by Professor Sir Rory Collins, began as pilot study in the early 2000s, soon after scientists unveiled the first drafts of the entire human genetic code, or genome, when health researchers realised that big data on genetics health and lifestyle could give profound insights into who is most at risk of ill health and how diseases could be prevented.

Recruiting volunteers in 2006-2010, and launching the database in 2012, UK Biobank now ranks as the world’s most important database for health research. At the outset, said Prof Allen, ‘about 5.5 per cent of our cohort were black or ethnic minorities.’ That was similar to the 2001 census, she said, but underrepresents today’s UK ethnic diversity.

However, though 94.5 per cent of UK Biobank’s cohort is white, the study is so big overall that it still has significant numbers of people from diverse groups, including around 10,000 people of South Asian ancestry and 10,000 people of African ancestry, representing (until more focused studies take over) the world’s biggest whole genome studies of East Asian, South Asian and Africa populations.

These would be enough to find, say, an inherited disease linked with a single gene in different groups.


However, Prof Allen said there is an issue concerning complex diseases which occur because of the influence of many DNA variants, so called polygenic diseases, paired with environmental influences (such a diet, sleep, stress, and smoking).

These studies can help predict how at risk an individual is of disease but, because most genomic studies to date, such as UK Biobank, have examined individuals of European/white ancestry, there is not always enough data about genomic variants from other populations to work out a polygenic risk score for non-white populations, she explained.

Even so, for many conditions, she added, ‘the polygenic risk scores work pretty well across all ethnic groups because most of our genetics are pretty similar.’

That is just as well because recruiting more diverse and minority populations to UK Biobank would take decades to provide enough data to catch up with the current database: more than 10,000 variables are collected on many of the volunteers, notably through genetic analyses, lifestyle questionnaires, routine measurements of blood, urine and saliva, eye examinations and so on.

Moreover, 100,000 volunteers wore smartwatches for a week to record their activity, while another 100,000 are set to receive MRI scans of the brain, abdomen, heart, and so on.  Moreover, UK Biobank recently added the data from whole genome sequencing of half a million volunteers, a mammoth project that took five years and over 350,000 hours of sequencing.

To obtain access to more diverse data, and not just ethnic but cultural, dietary and so on, UK Biobank is working with UK Biobank-like studies in other parts of the world, such as the China Kadoorie Biobank, Mexico City Prospective Study, and America’s All of Us programme, ‘which is deliberately oversampling people from underserved communities, particularly to address this issue.’

However Africa, the cradle of humanity, poses particular challenges because it has the greatest diversity of all, with many hundreds of different ethnic groups.

A developing  UK study, Our Future Health, ambitiously aims to recruit 5,000,000 participants across the UK and is also trying to recruit more from the UK’s underserved populations, she added.


This research is also important for the use of artificial intelligence, when it is trained on big data to spot patterns in disease risk. One issue is that, if those data are biased, then the AI will be biased too, perpetuating health inequalities.

A recent report, Equity in Medical Devices, pointed out that the advance of AI in medical devices brings with it not only great potential benefits to medicine but also possible harm through inherent bias against certain groups– notably women, ethnic minority and disadvantaged socio-economic groups.

Prof Allen said: ‘Those models have been developed on a largely white population and so there’s necessity for those models to be replicated in different populations… science can only progress if the finding from one study is replicated in a different population to prove that it’s generalizable to the wider public.’


More than 30,000 researchers in over 90 countries are registered to use UK Biobank data, and they have produced more than 10,000 academic papers, with this number exponentially increasing. Many of the genetic studies have focused on the data from the 95% white cohort, though there are moves under way to include minorities. These studies have revealed:

Type I diabetes is not solely a disease of childhood. In fact, more than 40% of cases begin after the age of 30, a finding which should save lives by preventing misdiagnosis.

By studying the 100,000 UK Biobank volunteers who wore smart watches for a week, researchers found that activity data could be used to predict Parkinson’s disease up to seven years before diagnosis.

A pattern of four proteins, all part of UK Biobank data, could predict onset of dementia more than a decade before formal diagnosis, a huge step towards a blood test for predicting the condition.

The changes that occur in the brain following infection with SARS-CoV-2, where UK Biobank data showed brain tissue loss, damage in areas connected with smell, and reduced cognitive ability, even in mild cases with no hospitalisation.

Genes which may be protective against obesity. The biotech company Regeneron found specific mutations in a gene called GPR75 are associated with lower body weight, which could help develop drugs to mimic the protective effects of this variant.

People with diabetes had smaller heart chambers, and thicker left ventricle walls, according to a study by Queen Mary University of London, funderad by the British Heart Foundation. These findings could help to detect heart damage early in people with diabetes to ensure that they are provided treatment.

Sleeping for fewer than six hours a night is associated with higher risk of death, according to researchers at the University of Oxford. Additionally, findings showed that increasing physical activity lowered the risk of some cancers and cardiovascular disease.

With core funding from Wellcome and the Medical Research Council, UK Biobank will move to a new state-of-the-art facility in Manchester in mid-2026.

Watch How to fix racial bias in genetic databases on the Science Museum YouTube channel. In a previous discussion about the toxic fiction that is ‘race science’, a panel at the Science Museum examined how questionable research has been used to misrepresent the abilities of different ethnic groups, notably when it comes to intelligence and sport.