Preparation articles metadata for batch import into DSpace repository
DOI:
https://doi.org/10.52575/2687-0932-2021-48-3-564-577Keywords:
Institutional Repository, Scopus, Web of Science, DSpace, Microsoft Excel, Python, pandas.DataFrameAbstract
Manual import of metadata records about research articles in institutional repository DSpace take a lot of time even when the input data uploads from Scopus and Web of Science databases and already has a format close to Dublin Core Metadata Element Set. To solve the problem of transforming and combining data, as well as integrating the article PDFs into the final metadata archive, some algorithms were developed. Algorithms use Microsoft Office Excel and free software. In addition, software tools by Python-scripts using "pandas" software library were created that automate most of the routine operations such as combine Scopus and Web of Science databases data export into single file, records dublicate exclude, converting authors record format and excluding records which already exist in DSpace repository. The use of these algorithms and the created software tools help to create Simple Archive Format file for batch import into DSpace repository and demonstrated a 29-fold reduction in time compared to manually metadata entering.
Downloads
References
Clarivate Analytics Web of Science. Available at: https://apps.webofknowledge.com/ WOS_GeneralSearch_input.do?product=WOS&search_mode=GeneralSearch&SID=C3Qtws6Zp9bRCWtj7S7&preferencesSaved= (accessed 2 June 2021)
Deng Sai. 2010. Optimizing Workflow through Metadata Repurposing and Batch Processing. Journal of Library Metadata, 10(4): 219-237. Available at: https://www.tandfonline.com/doi/abs/ 10.1080/19386389.2010.524862 (accessed 2 June 2021). DOI: 10.1080/19386389.2010.524862
Dietz Peter. 2015. Simple Archive Format Packager. Available at: https://wiki.lyrasis.org/display/DSPACE/Simple+Archive+Format+Packager (accessed 2 June 2021)
DuraSpace DSpace – A Turnkey Institutional Repository Application. Available at: https://duraspace.org/dspace/ (accessed 2 June 2021)
Dublin Core™ Metadata Initiative. Available at: http://dublincore.org (accessed 2 June 2021)
Elsevier Scopus. Available at: https://www.scopus.com/search/form.uri?display=basic= (accessed 2 June 2021)
Bruns Dave. 2021. EXCELJET. Quick, clean, and to the point. Excel VLOOKUP Function. Available at: https://exceljet.net/excel-functions/excel-vlookup-function (accessed 2 June 2021).
Fedotova O.A., Fedotov A.N., Zhizhimov O.L., Sambetbayeva M.A. 2020. DIGITAL REPOSITORY FOR RESEARCH AND EDUCATION INFORMATION SYSTEMS. Proceedings of SPSTL SB RAS, 3: 23-28. Available at: https://proceedings.gpntbsib.ru/jour/article/view/7 (accessed 2 June 2021). DOI: 10.20913/2618-7515-2019-3-23-28 (in Russian)
Bicking Ian, Leidel Jannis. 2021. fuzzywuzzy PyPI. Available at: https://pypi.org/project/ fuzzywuzzy/ (accessed 2 June 2021)
Gafurova P.O., Elizarov A.M., Lipachev E.K., Khammatova D.M. 2020. Metadata Normalization Methods in the Digital Mathematical Library. CEUR Workshop Proceedings, 2543: 136–148. Available at: http://ceur-ws.org/Vol-2543/rpaper13.pdf (accessed 2 June 2021)
Kim Jensen. 2021. Advanced Renamer. Batch file renaming utility for Windows. Available at: https://www.advancedrenamer.com (accessed 2 June 2021)
JetBrain PyCharm: The Python IDE for Professional Developers. Available at: https://www.jetbrains.com/pycharm/ (accessed 2 June 2021)
Nash Jacob L., Wheeler Jonathan. 2016. Desktop Batch Import Workflow for Ingesting Heterogeneous Collections: A Case Study with DSpace 5. D-Lib Magazine, 22 (1–2). Available at: http://www.dlib.org/dlib/january16/nash/01nash.html (accessed 2 June 2021). DOI: 10.1045/january2016-nash
OpenDOAR. Browse by Country and Region. Available at: https://v2.sherpa.ac.uk/view/repository_by_country/Russian_Federation.software_name.html (accessed 2 June 2021)
Oracle Java SE Runtime Environment 8. Available at: https://www.oracle.com/java/technologies/ java-se-glance.html (accessed 2 June 2021)
Wood Andrew. 2021. pandas.DataFrame. Available at: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html (accessed 2 June 2021)
Rachum Ram. 2021. re – Regular expression operations. Available at: https://docs.python.org/3/library/re.html (accessed 2 June 2021)
Registry of Open Access Repositories. Available at: http://roar.eprints.org/cgi/ roar_search/advanced?location_country=ru&software=&type=&order=-recordcount%2F-date (accessed 2 June 2021)
Weterings Niels. 2021. Text to Columns – Easy Excel Tutorial. Available at: https://www.excel-easy.com/examples/text-to-columns.html (accessed 2 June 2021)
Walsh Maureen P. 2010. Batch Loading Collections into DSpace: Using Perl Scripts for Automation and Quality Control. Information Technology and Libraries 29, no. 3 (2010): 117–127. Available at: https://ejournals.bc.edu/index.php/ital/article/view/3137 (accessed 2 June 2021). DOI: https://doi.org/10.6017/ital.v29i3.3137
What is Power Query? Available at: https://powerquery.microsoft.com/en-us/ (accessed 2 June 2021)
Reznichenko Oleg. 2021. Appendix to article "Preparation articles metadata for batch import into DSpace repository" Available at: https://github.com/leo-phoenix/dspace_batch_import (accessed 2 June 2021)
Abstract views: 191
Share
Published
How to Cite
Issue
Section
Copyright (c) 2021 ECONOMICS. INFORMATION TECHNOLOGIES
![Creative Commons License](http://i.creativecommons.org/l/by/4.0/88x31.png)
This work is licensed under a Creative Commons Attribution 4.0 International License.