Использование методов машинного обучения для решения задачи предсказания развития отклонений в живых системах

Yaroslav A. Marenkov; Arseniy S. Lomakin; Anastasia R. Donsckaia; Anna A. Panina

doi:10.52575/2687-0932-2025-52-3-665-680

Authors

Yaroslav A. Marenkov Volgograd State Technical University https://orcid.org/0009-0005-7314-8924
Arseniy S. Lomakin Volgograd State Technical University https://orcid.org/0009-0001-9340-1748
Anastasia R. Donsckaia Volgograd State Technical University; Volgograd State Medical University of Public Health Ministry of the Russian Federation https://orcid.org/0000-0003-3086-4929
Anna A. Panina Volgograd State Medical University of Public Health Ministry of the Russian Federation https://orcid.org/0000-0003-2750-8579

DOI:

https://doi.org/10.52575/2687-0932-2025-52-3-665-680

Keywords:

machine learning, interpretation of machine learning model predictions, machine learning models, machine learning methods, early diagnosis of cancer

Abstract

The purpose of the research is to partially automate the process of early cancer diagnosis by identifying suspected cancer in a patient based on machine learning and providing interpretive information. Cancer of various organs is among the deadliest human diseases in the world, and a large group of population is exposed to it. According to the data of the Ministry of Health of Russia and calculations of Rosstat for 2005–2023, from 427 thousand to 552 thousand new patients were detected in Russia annually. To solve the problem of searching for effective and accurate methods of early cancer diagnosis in medicine, various computer technologies, including machine learning methods, have recently become increasingly used. The main disadvantage of these methods is low human interpretability of the result obtained. Therefore, there is a need to develop an interpretable method to determine suspected cancer. This paper deals with the problem of interpretability of prediction results performed by machine learning models to solve the problem of predicting the development of abnormalities in living systems using human pancreatic cancer as an example. An interpretable machine learning model was developed based on the results of an extended blood test matched to the fact whether the patient has pancreatic cancer or not. We used local interpretability of the model using SHAP (Shapley Additive exPlanations) method with visualization in the form of a waterfall graph. A variety of classical machine learning models were trained; a comparative analysis of these models was made to identify the result with higher accuracy. Based on the obtained model, interpretive information was formed. A user application was developed to interact with the model. The Random Forest model performed best with an f1-score of 0.859.

Downloads

Download data is not yet available.

Author Biographies

Yaroslav A. Marenkov, Volgograd State Technical University

1st year Master’s degree student of the Department of Software Engineering, Volgograd State Technical University, Volgograd, Russia.
E-mail: yaroslavmarenkov5@gmail.com
ORCID: 0009-0005-7314-8924

Arseniy S. Lomakin, Volgograd State Technical University

1st year Master’s degree student of the Department of Software En- gineering, Volgograd State Technical University, Volgograd, Russia.
E-mail: arseny.lomakin@gmail.com
ORCID: 0009-0001-9340-1748

Anastasia R. Donsckaia, Volgograd State Technical University; Volgograd State Medical University of Public Health Ministry of the Russian Federation

Senior Lecturer at the Department of Software Engineering, Volgograd State Technical University, Volgograd, Russia; Senior lecturer at the Department of Clinical Engineering and Artificial Intelligence Technologies, Volgograd State Medical University of the Ministry of Health of Russia, Volgograd, Russia.
E-mail: donsckaia.anastasiya@yandex.ru
ORCID: 0000-0003-3086-4929

Anna A. Panina, Volgograd State Medical University of Public Health Ministry of the Russian Federation

MD, Associate Professor of the Department of Radiation, Functional, Laboratory Diagnostics, Volgograd State Medical University of the Ministry of Health of Russia, Volgograd, Russia.
E-mail: panina0675@yandex.ru
ORCID: 0000-0003-2750-8579

References

Список источников

Заболеваемость населения социально-значимыми болезнями: Здравоохранение // Росстат: официальный сайт. – URL: https://rosstat.gov.ru/folder/13721 (дата обращения: 20.03.2025).

DecisionTreeClassifier: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit- learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html (дата обращения: 08.05.2025).

RandomForestClassifier: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html (дата обращения: 08.05.2025).

HistGradientBoostingClassifier: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.HistGradientBoostingClassifier.html (дата обращения: 08.05.2025).

KNeighborsClassifier: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit- learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html (дата обращения: 08.05.2025).

SVC: [Справочник API] // Scikit-learn: официальный сайт. – URL: https://scikit- learn.org/stable/modules/generated/sklearn.svm.SVC.html (дата обращения: 08.05.2025).

Иванько, А.Ф. Информационные системы в издательском деле: учебное пособие / А.Ф. Иванько, М.А. Иванько. – Санкт-Петербург: Лань, 2022. – 148 с. – ISBN 978-5-8114-3843-3.

Список литературы

Барчук А., Атрощенко А., Гайдуков В., Виноградов П., Тараканов С., Канаев С., Арсеньев А., Комаров Ю., Харитонов М., Барчук, А., Мерабишвили В., Кузнецов В., Трофимов В., Гусарова Н., Коцюба И., Беляев А., Подольский М., Нефедова, А. 2017. Автоматизированная диагностика в популяционном скрининге рака легкого. Вопросы онкологии, 63(2): 215–220. https://doi.org/10.37469/0507-3758-2017-63-2-215-220.

Гундырев И.А., Бельская Л.В., Косенок В.И., Сарф Е.А. 2018. Применение синтетических образов для решения задачи классификации на примере диагностики рака легкого. Вестник Российской Академии медицинских наук, 73(2): 96–104. https://doi.org/10.15690/vramn946.

Bates S., Hastie T., Tibshirani R., 2023. Cross-Validation: What Does It Estimate and How Well Does It Do It. Journal of the American Statistical Association, 119(546): 1434–1445. https://doi.org/10.1080/01621459.2023.2197686.

Berger V., Zhou Y., ред. Balakrishnan N., Colton T., Everitt B., Piegorsch W., Ruggeri F., Teugels J.L., 2014. Kolmogorov–Smirnov Test: Overview. Wiley StatsRef: Statistics Reference Online. https://doi.org/10.1002/9781118445112.stat06558.

Cunningham P., Delany S.J., 2021. K-Nearest Neighbour Classifiers - A Tutorial. ACM Comput. Surv, 54(6): article number 128. https://doi.org/10.1145/3459665.

Demidova L.A., 2023. A Novel Approach to Decision-Making on Diagnosing Oncological Diseases Using Machine Learning Classifiers Based on Datasets Combining Known and/or New Generated Features of a Different Nature. Mathematics, 11(4). Article number: 792. https://doi.org/10.3390/math11040792.

Guryanov A., 2019. Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees. Analysis of Images, Social Networks and Texts: 8th International Conference, AIST 2019. Cham: Springer. 39–50. https://doi.org/10.1007/978-3-030-37334-4_4.

Hakkoum H., Idri A., Abnane I., 2024. Global and local interpretability techniques of supervised machine learning black box models for numerical medical data. Engineering Applications of Artificial Intelligence, 131. https://doi.org/10.1016/j.engappai.2023.107829.

Li Z., 2022. Extracting spatial effects from machine learning model using local interpretation method: An example of SHAP and XGBoost. Computers, Environment and Urban Systems, 96: article number 101845. https://doi.org/10.1016/j.compenvurbsys.2022.101845.

Lundberg S.M., Lee S., 2017. A unified approach to interpreting model predictions. Advances in neural information processing systems. Red Hook: Curran Associates Inc. 4768–4777. https://doi.org/10.48550/arXiv.1705.07874.

McKnight P.E., Najab J., ред. Weiner I.B., Craighead W.E., 2010. Mann‐Whitney U Test. The Corsini Encyclopedia of Psychology. https://doi.org/10.1002/9780470479216.corpsy0524.

Mustafa A.D., Abdulazeez A.M., 2021. Machine Learning Applications based on SVM Classification A Review. Qubahan Academic Journal, 1(2): 81–90. https://doi.org/10.48161/qaj.v1n2a50.

Pelegrina, Duarte L.T., Grabisch M., 2023. A k-additive Choquet integral-based approach to approximate the SHAP values for local interpretability in machine learning. Artificial Intelligence, 325: article number 104014. https://doi.org/10.48550/arXiv.2211.02166.

Priyanka, D. Kumar., 2020. Decision tree classifier: a detailed survey. International Journal of Information and Decision Sciences, 12(3): 246–269. https://doi.org/10.1504/ijids.2020.10029122.

Qui J., Wu Y., Chen J., Hui B., Huang Z., Ji L., 2018. A texture analysis method based on statistical contourlet coefficient applied to the classification of pancreatic cancer and normal pancreas. Proceedings of the International Symposium on Big Data and Artificial Intelligence (ISBDAI '18). New York: ACM. 9–13. https://doi.org/10.1145/3305275.3305278.

Raju V.N.G., Lakshmi K.P., Jain V.M., Kalidindi A., Padma V., 2020. Study the Influence of Normalization/Transformation process on the Accuracy of Supervised Classification. Third International Conference on Smart Systems and Inventive Technology (ICSSIT). Tirunelveli: IEEE. 729–735. http://dx.doi.org/10.1109/ICSSIT48917.2020.9214160.

Reddy N.S., Khanaa V., 2023. Intelligent deep learning algorithm for lung cancer detection and classification. Bulletin of Electrical Engineering and Informatics, 12(3): 1747–1754. https://doi.org/10.11591/eei.v12i3.4579.

Schonlau M., Zou R. Y., 2020. The random forest algorithm for statistical learning. The Stata Journal, 20(1): 3-29. https://doi.org/10.1177/1536867X20909688.

Soofi A.A., Awan A., 2017. Classification Techniques in Machine Learning: Applications and Issues. Journal of Basic & Applied Sciences, 13: 459–465. http://dx.doi.org/10.6000/1927-5129.2017.13.76.

Stiglic G., Kocbek P., Fijacko N., Zitnik M., Verbert K., Cilar L., 2020. Interpretability of machine learning- based prediction models in healthcare. WIREs Data Mining and Knowledge Discovery, 10(5): e1379. https://doi.org/10.1002/widm.1379.

Yang J., Rahardja S., Fränti P., 2019. Outlier detection: how to threshold outlier scores. Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing (AIIPCC '19). New York: ACM. 1-6. https://doi.org/10.1145/3371425.3371427.

References

Barchuk A., Atroshchenko A., Gaidukov V., Vinogradov P., Tarakanov S., Kanaev S., Arsen'ev A., Komarov YU., Kharitonov M., Barchuk, A., Merabishvili V., Kuznetsov V., Trofimov V., Gusarova N., Kotsyuba I., Belyaev A., Podol'skii M., Nefedova, A., 2017. Automated diagnosis in a population-based screening for lung cancer. Problems in oncology, 63(2): 215–220. https://doi.org/10.37469/0507-3758-2017-63- 2-215-220. (in Russian)

Gundyrev I.A., Bel'skaya L.V., Kosenok V.K., Sarf E.A., 2018. The use of synthetic images for solving the classification problem by the example of lung cancer diagnosis. Annals of the Russian academy of medical sciences, 73(2): 96–104. https://doi.org/10.15690/vramn946. (in Russian)