Applying Machine Learning Methods to Solve the Problem of Predicting the Progression of Abnormalities in Living Systems
DOI:
https://doi.org/10.52575/2687-0932-2025-52-3-665-680Keywords:
machine learning, interpretation of machine learning model predictions, machine learning models, machine learning methods, early diagnosis of cancerAbstract
The purpose of the research is to partially automate the process of early cancer diagnosis by identifying suspected cancer in a patient based on machine learning and providing interpretive information. Cancer of various organs is among the deadliest human diseases in the world, and a large group of population is exposed to it. According to the data of the Ministry of Health of Russia and calculations of Rosstat for 2005–2023, from 427 thousand to 552 thousand new patients were detected in Russia annually. To solve the problem of searching for effective and accurate methods of early cancer diagnosis in medicine, various computer technologies, including machine learning methods, have recently become increasingly used. The main disadvantage of these methods is low human interpretability of the result obtained. Therefore, there is a need to develop an interpretable method to determine suspected cancer. This paper deals with the problem of interpretability of prediction results performed by machine learning models to solve the problem of predicting the development of abnormalities in living systems using human pancreatic cancer as an example. An interpretable machine learning model was developed based on the results of an extended blood test matched to the fact whether the patient has pancreatic cancer or not. We used local interpretability of the model using SHAP (Shapley Additive exPlanations) method with visualization in the form of a waterfall graph. A variety of classical machine learning models were trained; a comparative analysis of these models was made to identify the result with higher accuracy. Based on the obtained model, interpretive information was formed. A user application was developed to interact with the model. The Random Forest model performed best with an f1-score of 0.859.
Downloads
References
Список источников
Заболеваемость населения социально-значимыми болезнями: Здравоохранение // Росстат: официальный сайт. – URL: https://rosstat.gov.ru/folder/13721 (дата обращения: 20.03.2025).
DecisionTreeClassifier: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit- learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html (дата обращения: 08.05.2025).
RandomForestClassifier: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (дата обращения: 08.05.2025).
HistGradientBoostingClassifier: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html (дата обращения: 08.05.2025).
KNeighborsClassifier: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit- learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html (дата обращения: 08.05.2025).
SVC: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit- learn.org/stable/modules/generated/sklearn.svm.SVC.html (дата обращения: 08.05.2025).
Иванько, А.Ф. Информационные системы в издательском деле: учебное пособие / А.Ф. Иванько, М.А. Иванько. – Санкт-Петербург: Лань, 2022. – 148 с. – ISBN 978-5-8114-3843-3.
Список литературы
Барчук А., Атрощенко А., Гайдуков В., Виноградов П., Тараканов С., Канаев С., Арсеньев А., Комаров Ю., Харитонов М., Барчук, А., Мерабишвили В., Кузнецов В., Трофимов В., Гусарова Н., Коцюба И., Беляев А., Подольский М., Нефедова, А. 2017. Автоматизированная диагностика в популяционном скрининге рака легкого. Вопросы онкологии, 63(2): 215–220. https://doi.org/10.37469/0507-3758-2017-63-2-215-220.
Гундырев И.А., Бельская Л.В., Косенок В.И., Сарф Е.А. 2018. Применение синтетических образов для решения задачи классификации на примере диагностики рака легкого. Вестник Российской Академии медицинских наук, 73(2): 96–104. https://doi.org/10.15690/vramn946.
Bates S., Hastie T., Tibshirani R., 2023. Cross-Validation: What Does It Estimate and How Well Does It Do It. Journal of the American Statistical Association, 119(546): 1434–1445. https://doi.org/10.1080/01621459.2023.2197686.
Berger V., Zhou Y., ред. Balakrishnan N., Colton T., Everitt B., Piegorsch W., Ruggeri F., Teugels J.L., 2014. Kolmogorov–Smirnov Test: Overview. Wiley StatsRef: Statistics Reference Online. https://doi.org/10.1002/9781118445112.stat06558.
Cunningham P., Delany S.J., 2021. K-Nearest Neighbour Classifiers - A Tutorial. ACM Comput. Surv, 54(6): article number 128. https://doi.org/10.1145/3459665.
Demidova L.A., 2023. A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature. Mathematics, 11(4). Article number: 792. https://doi.org/10.3390/math11040792.
Guryanov A., 2019. Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees. Analysis of Images, Social Networks and Texts: 8th International Conference, AIST 2019. Cham: Springer. 39–50. https://doi.org/10.1007/978-3-030-37334-4_4.
Hakkoum H., Idri A., Abnane I., 2024. Global and local interpretability techniques of supervised machine learning black box models for numerical medical data. Engineering Applications of Artificial Intelligence, 131. https://doi.org/10.1016/j.engappai.2023.107829.
Li Z., 2022. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems, 96: article number 101845. https://doi.org/10.1016/j.compenvurbsys.2022.101845.
Lundberg S.M., Lee S., 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems. Red Hook: Curran Associates Inc. 4768–4777. https://doi.org/10.48550/arXiv.1705.07874.
McKnight P.E., Najab J., ред. Weiner I.B., Craighead W.E., 2010. Mann‐Whitney U Test. The Corsini Encyclopedia of Psychology. https://doi.org/10.1002/9780470479216.corpsy0524.
Mustafa A.D., Abdulazeez A.M., 2021. Machine Learning Applications based on SVM Classification A Review. Qubahan Academic Journal, 1(2): 81–90. https://doi.org/10.48161/qaj.v1n2a50.
Pelegrina, Duarte L.T., Grabisch M., 2023. A k-additive Choquet integral-based approach to approximate the SHAP values for local interpretability in machine learning. Artificial Intelligence, 325: article number 104014. https://doi.org/10.48550/arXiv.2211.02166.
Priyanka, D. Kumar., 2020. Decision tree classifier: a detailed survey. International Journal of Information and Decision Sciences, 12(3): 246–269. https://doi.org/10.1504/ijids.2020.10029122.
Qui J., Wu Y., Chen J., Hui B., Huang Z., Ji L., 2018. A texture analysis method based on statistical contourlet coefficient applied to the classification of pancreatic cancer and normal pancreas. Proceedings of the International Symposium on Big Data and Artificial Intelligence (ISBDAI '18). New York: ACM. 9–13. https://doi.org/10.1145/3305275.3305278.
Raju V.N.G., Lakshmi K.P., Jain V.M., Kalidindi A., Padma V., 2020. Study the Influence of Normalization/Transformation process on the Accuracy of Supervised Classification. Third International Conference on Smart Systems and Inventive Technology (ICSSIT). Tirunelveli: IEEE. 729–735. http://dx.doi.org/10.1109/ICSSIT48917.2020.9214160.
Reddy N.S., Khanaa V., 2023. Intelligent deep learning algorithm for lung cancer detection and classification. Bulletin of Electrical Engineering and Informatics, 12(3): 1747–1754. https://doi.org/10.11591/eei.v12i3.4579.
Schonlau M., Zou R. Y., 2020. The random forest algorithm for statistical learning. The Stata Journal, 20(1): 3-29. https://doi.org/10.1177/1536867X20909688.
Soofi A.A., Awan A., 2017. Classification Techniques in Machine Learning: Applications and Issues. Journal of Basic & Applied Sciences, 13: 459–465. http://dx.doi.org/10.6000/1927-5129.2017.13.76.
Stiglic G., Kocbek P., Fijacko N., Zitnik M., Verbert K., Cilar L., 2020. Interpretability of machine learning- based prediction models in healthcare. WIREs Data Mining and Knowledge Discovery, 10(5): e1379. https://doi.org/10.1002/widm.1379.
Yang J., Rahardja S., Fränti P., 2019. Outlier detection: how to threshold outlier scores. Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing (AIIPCC '19). New York: ACM. 1-6. https://doi.org/10.1145/3371425.3371427.
References
Barchuk A., Atroshchenko A., Gaidukov V., Vinogradov P., Tarakanov S., Kanaev S., Arsen'ev A., Komarov YU., Kharitonov M., Barchuk, A., Merabishvili V., Kuznetsov V., Trofimov V., Gusarova N., Kotsyuba I., Belyaev A., Podol'skii M., Nefedova, A., 2017. Automated diagnosis in a population-based screening for lung cancer. Problems in oncology, 63(2): 215–220. https://doi.org/10.37469/0507-3758-2017-63- 2-215-220. (in Russian)
Gundyrev I.A., Bel'skaya L.V., Kosenok V.K., Sarf E.A., 2018. The use of synthetic images for solving the classification problem by the example of lung cancer diagnosis. Annals of the Russian academy of medical sciences, 73(2): 96–104. https://doi.org/10.15690/vramn946. (in Russian)
Bates S., Hastie T., Tibshirani R., 2023. Cross-Validation: What Does It Estimate and How Well Does It Do It. Journal of the American Statistical Association, 119(546): 1434–1445. https://doi.org/10.1080/01621459.2023.2197686.
Berger V., Zhou Y., ред. Balakrishnan N., Colton T., Everitt B., Piegorsch W., Ruggeri F., Teugels J.L., 2014. Kolmogorov–Smirnov Test: Overview. Wiley StatsRef: Statistics Reference Online. https://doi.org/10.1002/9781118445112.stat06558.
Cunningham P., Delany S.J., 2021. K-Nearest Neighbour Classifiers - A Tutorial. ACM Comput. Surv, 54(6): article number 128. https://doi.org/10.1145/3459665.
Demidova L.A., 2023. A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature. Mathematics, 11(4). Article number: 792. https://doi.org/10.3390/math11040792.
Guryanov A., 2019. Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees. Analysis of Images, Social Networks and Texts: 8th International Conference, AIST 2019. Cham: Springer. 39–50. https://doi.org/10.1007/978-3-030-37334-4_4.
Hakkoum H., Idri A., Abnane I., 2024. Global and local interpretability techniques of supervised machine learning black box models for numerical medical data. Engineering Applications of Artificial Intelligence, 131. https://doi.org/10.1016/j.engappai.2023.107829.
Li Z., 2022. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems, 96: article number 101845. https://doi.org/10.1016/j.compenvurbsys.2022.101845.
Lundberg S.M., Lee S., 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems. Red Hook: Curran Associates Inc. 4768–4777. https://doi.org/10.48550/arXiv.1705.07874.
McKnight P.E., Najab J., ред. Weiner I.B., Craighead W.E., 2010. Mann‐Whitney U Test. The Corsini Encyclopedia of Psychology. https://doi.org/10.1002/9780470479216.corpsy0524.
Mustafa A.D., Abdulazeez A.M., 2021. Machine Learning Applications based on SVM Classification A Review. Qubahan Academic Journal.; 1(2): 81–90. https://doi.org/10.48161/qaj.v1n2a50.
Pelegrina, Duarte L.T., Grabisch M., 2023. A k-additive Choquet integral-based approach to approximate the SHAP values for local interpretability in machine learning. Artificial Intelligence, 325: article number 104014. https://doi.org/10.48550/arXiv.2211.02166.
Priyanka, D. Kumar., 2020. Decision tree classifier: a detailed survey. International Journal of Information and Decision Sciences, 12(3): 246–269. https://doi.org/10.1504/ijids.2020.10029122.
Qui J., Wu Y., Chen J., Hui B., Huang Z., Ji L., 2018. A texture analysis method based on statistical contourlet coefficient applied to the classification of pancreatic cancer and normal pancreas. Proceedings of the International Symposium on Big Data and Artificial Intelligence (ISBDAI '18). New York: ACM. 9–13. https://doi.org/10.1145/3305275.3305278.
Raju V.N.G., Lakshmi K.P., Jain V.M., Kalidindi A., Padma V., 2020. Study the Influence of Normalization/Transformation process on the Accuracy of Supervised Classification. Third International Conference on Smart Systems and Inventive Technology (ICSSIT). Tirunelveli: IEEE. 729–735. http://dx.doi.org/10.1109/ICSSIT48917.2020.9214160.
Reddy N.S., Khanaa V., 2023. Intelligent deep learning algorithm for lung cancer detection and classification. Bulletin of Electrical Engineering and Informatics, 12(3): 1747–1754. https://doi.org/10.11591/eei.v12i3.4579.
Schonlau M., Zou R. Y., 2020. The random forest algorithm for statistical learning. The Stata Journal, 20(1): 3–29. https://doi.org/10.1177/1536867X20909688.
Soofi A.A., Awan A., 2017. Classification Techniques in Machine Learning: Applications and Issues. Journal of Basic & Applied Sciences, 13: 459–465. http://dx.doi.org/10.6000/1927-5129.2017.13.76.
Stiglic G., Kocbek P., Fijacko N., Zitnik M., Verbert K., Cilar L., 2020. Interpretability of machine learning- based prediction models in healthcare. WIREs Data Mining and Knowledge Discovery, 10(5): e1379. https://doi.org/10.1002/widm.1379.
Yang J., Rahardja S., Fränti P., 2019. Outlier detection: how to threshold outlier scores. Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing (AIIPCC '19). New York: ACM. 1–6. https://doi.org/10.1145/3371425.3371427.
Abstract views: 0
Share
Published
How to Cite
Issue
Section
Copyright (c) 2025 Economics. Information Technologies

This work is licensed under a Creative Commons Attribution 4.0 International License.