Preparation articles metadata for batch import into DSpace repository

Authors

  • Oleg S. Reznichenko Belgorod National Research University

DOI:

https://doi.org/10.52575/2687-0932-2021-48-3-564-577

Keywords:

Institutional Repository, Scopus, Web of Science, DSpace, Microsoft Excel, Python, pandas.DataFrame

Abstract

Manual import of metadata records about research articles in institutional repository DSpace take a lot of time even when the input data uploads from Scopus and Web of Science databases and already has a format close to Dublin Core Metadata Element Set. To solve the problem of transforming and combining data, as well as integrating the article PDFs into the final metadata archive, some algorithms were developed. Algorithms use Microsoft Office Excel and free software. In addition, software tools by Python-scripts using "pandas" software library were created that automate most of the routine operations such as combine Scopus and Web of Science databases data export into single file, records dublicate exclude, converting authors record format and  excluding records which already exist in DSpace repository. The use of these algorithms and the created software tools help to create Simple Archive Format file for batch import into DSpace repository and demonstrated a 29-fold reduction in time compared to manually metadata entering.

Downloads

Download data is not yet available.

Author Biography

Oleg S. Reznichenko, Belgorod National Research University

Senior Lecturer of the Department of Applied Information Science and Information Technologies, Institute of Engineering and Digital Technologies, Belgorod State University,
Belgorod, Russia

References

Clarivate Analytics Web of Science. Available at: https://apps.webofknowledge.com/ WOS_GeneralSearch_input.do?product=WOS&search_mode=GeneralSearch&SID=C3Qtws6Zp9bRCWtj7S7&preferencesSaved= (accessed 2 June 2021)

Deng Sai. 2010. Optimizing Workflow through Metadata Repurposing and Batch Processing. Journal of Library Metadata, 10(4): 219-237. Available at: https://www.tandfonline.com/doi/abs/ 10.1080/19386389.2010.524862 (accessed 2 June 2021). DOI: 10.1080/19386389.2010.524862

Dietz Peter. 2015. Simple Archive Format Packager. Available at: https://wiki.lyrasis.org/display/DSPACE/Simple+Archive+Format+Packager (accessed 2 June 2021)

DuraSpace DSpace – A Turnkey Institutional Repository Application. Available at: https://duraspace.org/dspace/ (accessed 2 June 2021)

Dublin Core™ Metadata Initiative. Available at: http://dublincore.org (accessed 2 June 2021)

Elsevier Scopus. Available at: https://www.scopus.com/search/form.uri?display=basic= (accessed 2 June 2021)

Bruns Dave. 2021. EXCELJET. Quick, clean, and to the point. Excel VLOOKUP Function. Available at: https://exceljet.net/excel-functions/excel-vlookup-function (accessed 2 June 2021).

Fedotova O.A., Fedotov A.N., Zhizhimov O.L., Sambetbayeva M.A. 2020. DIGITAL REPOSITORY FOR RESEARCH AND EDUCATION INFORMATION SYSTEMS. Proceedings of SPSTL SB RAS, 3: 23-28. Available at: https://proceedings.gpntbsib.ru/jour/article/view/7 (accessed 2 June 2021). DOI: 10.20913/2618-7515-2019-3-23-28 (in Russian)

Bicking Ian, Leidel Jannis. 2021. fuzzywuzzy PyPI. Available at: https://pypi.org/project/ fuzzywuzzy/ (accessed 2 June 2021)

Gafurova P.O., Elizarov A.M., Lipachev E.K., Khammatova D.M. 2020. Metadata Normalization Methods in the Digital Mathematical Library. CEUR Workshop Proceedings, 2543: 136–148. Available at: http://ceur-ws.org/Vol-2543/rpaper13.pdf (accessed 2 June 2021)

Kim Jensen. 2021. Advanced Renamer. Batch file renaming utility for Windows. Available at: https://www.advancedrenamer.com (accessed 2 June 2021)

JetBrain PyCharm: The Python IDE for Professional Developers. Available at: https://www.jetbrains.com/pycharm/ (accessed 2 June 2021)

Nash Jacob L., Wheeler Jonathan. 2016. Desktop Batch Import Workflow for Ingesting Heterogeneous Collections: A Case Study with DSpace 5. D-Lib Magazine, 22 (1–2). Available at: http://www.dlib.org/dlib/january16/nash/01nash.html (accessed 2 June 2021). DOI: 10.1045/january2016-nash

OpenDOAR. Browse by Country and Region. Available at: https://v2.sherpa.ac.uk/view/repository_by_country/Russian_Federation.software_name.html (accessed 2 June 2021)

Oracle Java SE Runtime Environment 8. Available at: https://www.oracle.com/java/technologies/ java-se-glance.html (accessed 2 June 2021)

Wood Andrew. 2021. pandas.DataFrame. Available at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html (accessed 2 June 2021)

Rachum Ram. 2021. re – Regular expression operations. Available at: https://docs.python.org/3/library/re.html (accessed 2 June 2021)

Registry of Open Access Repositories. Available at: http://roar.eprints.org/cgi/ roar_search/advanced?location_country=ru&software=&type=&order=-recordcount%2F-date (accessed 2 June 2021)

Weterings Niels. 2021. Text to Columns – Easy Excel Tutorial. Available at: https://www.excel-easy.com/examples/text-to-columns.html (accessed 2 June 2021)

Walsh Maureen P. 2010. Batch Loading Collections into DSpace: Using Perl Scripts for Automation and Quality Control. Information Technology and Libraries 29, no. 3 (2010): 117–127. Available at: https://ejournals.bc.edu/index.php/ital/article/view/3137 (accessed 2 June 2021). DOI: https://doi.org/10.6017/ital.v29i3.3137

What is Power Query? Available at: https://powerquery.microsoft.com/en-us/ (accessed 2 June 2021)

Reznichenko Oleg. 2021. Appendix to article "Preparation articles metadata for batch import into DSpace repository" Available at: https://github.com/leo-phoenix/dspace_batch_import (accessed 2 June 2021)


Abstract views: 191

Share

Published

2021-09-30

How to Cite

Reznichenko, O. S. (2021). Preparation articles metadata for batch import into DSpace repository. Economics. Information Technologies, 48(3), 564-577. https://doi.org/10.52575/2687-0932-2021-48-3-564-577

Issue

Section

SYSTEM ANALYSIS AND PROCESSING OF KNOWLEDGE