“Understanding both the biological and clinical aspects of the patient is essential to uncover the mechanism underlying the prognosis of the disease.”
Colorectal cancer (CRC) ranks among the most common and lethal cancers worldwide, accounting for approximately 10% of all cancer diagnoses. While advances in prevention and treatment have improved outcomes, predicting which patients will survive remains a complex challenge—one that depends on an intricate interplay between molecular biology and clinical factors.
A research paper, titled “Machine learning-based survival prediction in colorectal cancer combining clinical and biological features” was published in Volume 16 of Oncotarget by an international team of researchers, demonstrating how machine learning can integrate these two domains to achieve highly accurate survival predictions.
The team’s investigation demonstrates that combining clinical features—such as pathological stage, age, and lymph node status—with biological markers—including the E2F8 gene and hsa-miR-495-3p—can significantly improve the ability to predict patient survival.
The Method: Integrating Clinical and Biological Data
The researchers constructed a three-phase pipeline using data from 545 colorectal cancer patients from The Cancer Genome Atlas (TCGA) database. The data spanned colon, rectum, and rectosigmoid junction cancers, with patient ages ranging from 31 to 90 years.
In the first phase, data pre-processing, the team extracted and normalized both clinical and biological features. For biological features, they performed differential expression analysis, constructed competing endogenous RNA (ceRNA) networks, and conducted survival analysis to identify 19 candidate molecules—including mRNAs, lncRNAs, and miRNAs—with potential roles in CRC prognosis. For clinical features, they selected 13 characteristics, including age, pathological stage, lymph node counts, chemotherapy status, and new tumor events.
To handle missing data, they created three distinct cases: Case 1 filtered out missing biological or core clinical features; Case 2 also excluded patients with missing demographic features like race and weight; and Case 3 replaced missing values with the most frequent category.
In the second phase, feature selection, the team applied LASSO (Least Absolute Shrinkage and Selection Operator) to rank features by importance, followed by SHAP (Shapley Additive Explanations) to understand each feature’s impact on survival prediction.
In the third phase, model construction, they trained and compared six machine learning classifiers: Logistic Regression (LR), Support Vector Machine (SVM), Random Forest (RF), AdaBoost (AB), Stacking, and Voting.
Key Findings: Features That Matter Most
Across the three data cases, certain features consistently emerged as critical for predicting survival.
Among biological features, E2F8 stood out as the most significant, appearing in all three models. This gene, known to be associated with cell proliferation and CRC staging, has been identified by other studies as a potential CRC biomarker. WDR77 and hsa-miR-495-3p also proved important in most groups, consistent with previous research linking them to cancer development.
Among clinical features, pathological stage consistently ranked as the most influential predictor. Higher stage correlated strongly with lower survival probability. Age, new tumor event (likely representing recurrence), lymph node count, and chemotherapy status also emerged as critical factors.
Notably, the study identified that the combination of these features outperformed models relying on clinical or biological data alone.
Predictive Performance: Accuracy Reaches 89.58%
The machine learning models achieved impressive results. For Case 1 (filtered for core clinical features), an SVM model achieved 86.87% accuracy with an AUC of 83.49%. For Case 2 (more strictly filtered), an AdaBoost model achieved the best overall performance: 89.58% accuracy, though with a lower AUC of 76.50% due to dataset size limitations. For Case 3 (with imputed missing values), a Voting ensemble achieved 82.57% accuracy.
Bootstrap analysis confirmed that these advanced models provided meaningful improvements over baseline logistic regression, with accuracy increases ranging from 4.6% to 11.1%.
Biological Insights: The ceRNA Network Perspective
The 19 candidate molecules used as biological features were not chosen arbitrarily. They originated from a prior analysis by the same research group that constructed competing endogenous RNA (ceRNA) networks—complex regulatory systems where mRNAs, lncRNAs, and miRNAs cross-regulate each other through shared microRNA response elements.
This ceRNA framework is particularly relevant in cancer, where disruptions to these networks can drive tumor progression. By incorporating molecules from these networks, the study captured not just individual biomarkers but the broader regulatory context in which they operate.
Clinical Implications and Future Directions
The study’s findings carry several implications for clinical practice and future research.
First, they validate the prognostic value of well-established clinical factors—age, stage, lymph node status—while also highlighting novel molecular markers like E2F8 that warrant further investigation. Second, they demonstrate that machine learning can effectively integrate diverse data types to generate clinically useful predictions. Third, they underscore the importance of complete data collection; missing clinical information, such as race and weight, limited the analysis and may introduce bias.
The authors acknowledge limitations, including the relatively small dataset (545 patients), the exclusive use of US-based TCGA data, and the lack of experimental validation for the identified biomarkers. They call for future studies with larger, more diverse cohorts and for further investigation into the molecular mechanisms linking E2F8, miR-495-3p, and WDR77 to CRC prognosis.
Future Perspectives and Conclusion
This study does not claim to have developed a clinically deployable tool. Rather, it offers a proof-of-concept that machine learning can meaningfully integrate clinical and biological data to predict colorectal cancer survival. By combining LASSO feature selection with SHAP interpretability and ensemble modeling, the team demonstrates a pipeline that balances predictive power with biological insight.
The perspective that emerges is one where the future of cancer prognosis lies not in choosing between clinical or molecular data, but in systematically combining them. As the authors note, even basic patient information—age, weight, lymph node status—when accurately recorded and integrated with molecular profiles, can contribute powerfully to our understanding of disease trajectory.
Continued research will be needed to validate these findings in independent cohorts, to expand the set of biological features, and ultimately to translate these models into tools that can guide treatment decisions and improve outcomes for patients with colorectal cancer.
Click here to read the full research paper published in Oncotarget.
_______
Oncotarget is an open-access, peer-reviewed journal that has published primarily oncology-focused research papers since 2010. These papers are available to readers (at no cost and free of subscription barriers) in a continuous publishing format at Oncotarget.com.
Oncotarget is indexed and archived by PubMed/Medline, PubMed Central, Scopus, EMBASE, META (Chan Zuckerberg Initiative) (2018-2022), and Dimensions (Digital Science).
Click here to subscribe to Oncotarget publication updates.
For media inquiries, please contact media@impactjournals.com.


