Comprehensive Pancreatic Cancer Risk Prediction Model Integrating Genetic, Lifestyle, and Medical History Variables: Insights from the UK Biobank


Pancreatic cancer (PaCa) ranks as the 10th most common cancer and the 5th leading cause of cancer mortality in the United Kingdom. The prognosis for PaCa is poor, with a five-year survival rate of only 7%. This poor survival rate is largely attributed to the typically late stage at which the disease is diagnosed, because of its asymptomatic nature in early stages and the absence of effective screening programs with significantly limited treatment options. Identifying individuals at high risk for PaCa is essential for developing early prevention strategies and improving outcomes. However, the challenge lies in the multiple PaCa risk factors, which include genetic predisposition, lifestyle factors (such as smoking and alcohol consumption), and medical history-related conditions (like diabetes mellitus and pancreatitis). Single-nucleotide polymorphisms (SNPs) have been increasingly recognized as significant contributors to cancer risk, and numerous SNPs associated with PaCa have been identified through genome-wide association studies (GWAS) which facilitated the development of polygenic risk scores (PRS) that can aggregate the effects of multiple SNPs and stratify individuals based on their genetic susceptibility to PaCa. Despite these advancements, there is still a lack of comprehensive risk prediction models that integrate the full spectrum of PaCa risk factors, encompassing genetic, lifestyle, and medical history-related variables. Most existing models focus on a limited set of risk factors and fail to capture the complex interplay of causes that may contribute to PaCa risk. This gap highlights the need for an integrated approach that can effectively identify high-risk individuals and inform targeted prevention efforts. To this end, new study published in Biomedicines and conducted by PhD candidate Te-Min Ke, Dr. Artitaya Lophatananon, and Professor Kenneth Muir from the University of Manchester developed a new integrated PaCa risk prediction model. The team performed a nested case-control study using the UK Biobank cohort, which includes comprehensive health and genetic data from over 500,000 participants with 1,402 incident pancreatic cancer cases identified after study enrollment and 257,348 cancer-free controls. Afterward, they classified the exposure variables into three categories: non-modifiable variables (gender, age, blood type, family history of bowel cancer, and PRS), lifestyle-related modifiable (tobacco smoking, alcohol intake, BMI, waist-to-hip ratio, and physical activity), and medical history-related variables (pancreatitis, diabetes mellitus, hepatitis B, gallbladder-related diseases, Helicobacter pylori infection, peritonitis, vitamin D deficiency, and systemic lupus erythematosus). They derived the PRS from 40 SNPs associated with pancreatic cancer that were identified through GWAS. The PRS provided a quantitative measure of genetic susceptibility to PaCa, stratifying participants into quintiles. Higher PRS quintiles were significantly associated with increased PaCa risk which emphasized the genetic component’s importance in risk prediction.

The authors employed a random forest model to identify the most influential risk factors for PaCa. The model was trained on 85% of the dataset and tested on the remaining 15% which ensured robust internal validation through 10-fold cross-validation. The optimal parameters for the random forest model were determined using RandomizedSearchCV and GridSearchCV functions in the Scikit-learn package. The model revealed that the top five influential features were age, PRS, pancreatitis, DM, and smoking. Other significant variables included alcohol consumption, gallbladder-related diseases, BMI, physical activity, and gender. The researchers developed also a multivariable logistic regression model to complement the random forest model using stepwise selection methods which quantified the odds ratios (ORs) for each risk factor and provided a clear interpretation of their contributions to PaCa risk. The logistic regression model identified nine significant risk factors: male gender (OR = 1.17), age (OR = 1.10 per year), non-O blood type (OR = 1.29), higher PRS quintile (Q5 vs. Q1, OR = 2.03), current smoking (OR = 1.82), higher alcohol consumption (OR = 1.27), pancreatitis (OR = 3.99), DM (OR = 2.57), and gallbladder-related diseases (OR = 2.07). Moreover, the authors created visual nomograms based on the logistic regression model which made the findings accessible and actionable and allowed users to calculate the probability of developing PaCa by summing weighted point values for each risk factor. Additionally, they developed dynamic, web-based nomogram to provide an interactive tool for immediate risk assessment in clinical and community settings. The nomogram visualization highlighted the relative importance of each risk factor, with age, pancreatitis, DM, and PRS being the most influential.  The online availability of the dynamic further enhanced the model’s usability and enabled healthcare providers and individuals to easily assess PaCa risk and implement targeted prevention strategies. In conclusion, the authors’ approach of combining the results from both models provided a comprehensive understanding of PaCa risk factors where the random forest model identified the most critical risk variables and the logistic regression model quantified their impacts. Such dual model approach ensured a robust risk prediction framework capable of integrating genetic predisposition, lifestyle factors, and medical history. Moreover, the new dynamic nomograms allow for personalized risk assessment, making it easier for healthcare providers to tailor prevention and early detection strategies to individual patients which can potentially lead to earlier diagnosis and better outcomes for patients at high risk of PaCa. Furthermore, the visual and dynamic nomograms provide an intuitive tool for clinicians to assess and easily communicate risk to patients which can enhance patient understanding and engagement.


The research leading to the results presented in this paper has received funding from the European Union’s funded Project iHELP under grant agreement no 10101744. The iHELP Project focuses on developing and utilizing AI-driven learning and decision-support technology to identify and mitigate risks associated with pancreatic cancer at an early stage. For more information about the iHelp Project, please visit their website:

About the author

Te-Min Ke, Radiation Oncologist; PhD candidate in epidemiology at the University of Manchester.

About the author

Artitaya Lophatananon, Senior Research Fellow in Epidemiology at the University of Manchester.

About the author

Kenneth Muir, Professor of Epidemiology at the University of Manchester.


Ke TM, Lophatananon A, Muir KR. An Integrative Pancreatic Cancer Risk Prediction Model in the UK Biobank. Biomedicines. 2023 Dec 1;11(12):3206. doi: 10.3390/biomedicines11123206.

Go To Biomedicines.