Ribonucleic acid (RNA) virus and coronavirus in Google Dataset Search: their scope and epidemiological correlation

Authors

DOI:

https://doi.org/10.3145/epi.2020.nov.28

Keywords:

Data, Datasets, Viruses, RNA viruses, Coronavirus, SARS-CoV-2, Covid-19, Pandemics, Data reuse, Google, Google Dataset Search, Data providers, Search engines, Information retrieval, Open science

Abstract

This paper presents an analysis of the publication of datasets collected via Google Dataset Search, specialized in families of RNA viruses, whose terminology was obtained from the National Cancer Institute (NCI) thesaurus developed by the US Department of Health and Human Services. The objective is to determine the scope and reuse capacity of the available data, determine the number of datasets and their free access, the proportion in reusable download formats, the main providers, their publication chronology, and to verify their scientific provenance. On the other hand, we also define possible relationships between the publication of datasets and the main pandemics that have occurred during the last 10 years. The results obtained highlight that only 52% of the datasets are related to scientific research, while an even smaller fraction (15%) are reusable. There is also an upward trend in the publication of datasets, especially related to the impact of the main epidemics, as clearly confirmed for the Ebola virus, Zika, SARS-CoV, H1N1, H1N5, and especially the SARS-CoV-2 coronavirus. Finally, it is observed that the search engine has not yet implemented adequate methods for filtering and monitoring the datasets. These results reveal some of the difficulties facing open science in the dataset field.

Downloads

Download data is not yet available.

References

Ahlawat, Khyati; Chug, Anuradha; Singh, Amit-Prakash (2019). "Empirical evaluation of Map Reduce based hybrid approach for problem of imbalanced classification in big data". International journal of grid and high performance computing, v. 11, n. 3, pp. 23-45. https://doi.org/10.4018/IJGHPC.2019070102

Bekelman, Justin E.; MPhil, Yan-Li; Gross, Cary P. (2003). "Scope and impact of financial conflicts of interest in biomedical research: a systematic review". Jama, v. 289, n. 4, pp. 454-465. https://doi.org/10.1001/jama.289.4.454

Blischak, John D.; Davenport, Emily R.; Wilson, Greg (2016). "A quick introduction to version control with Git and GitHub". PLoS computational biology, v. 12, n. 1. https://doi.org/10.1371/journal.pcbi.1004668

Brickley, Dan; Burgess, Matthew; Noy, Natasha (2019). "Google Dataset Search: Building a search engine for datasets in an open web ecosystem". In: Proceedings of the 19th World wide web conference (WWW´19), pp. 1365-1375. https://doi.org/10.1145/3308558.3313685

Broder, Andrei (2002). "A taxonomy of web search". ACM Sigir forum, v. 36, n. 2, pp. 3-10. https://doi.org/10.1145/792550.792552

Canino, Adrienne (2019). "Deconstructing Google Dataset Search". Public services quarterly, v. 15, n. 3, pp. 248-255. https://doi.org/10.1080/15228959.2019.1621793

Chen, Emily; Lerman, Kristina; Ferrara, Emilio (2020). "Tracking social media discourse about the Covid-19 pandemic: Development of a public coronavirus Twitter data set". JMIR public health and surveillance, v. 6, n. 2. https://doi.org/10.2196/19273

Chen, Serena H.; Young, M. Todd; Gounley, John; Stanley, Christopher; Bhowmik, Debsindhu (2020). "Distinct structural flexibility within SARS-CoV-2 spike protein reveals potential therapeutic targets". BioRxiv. https://doi.org/10.1101/2020.04.17.047548

Corrales-Garay, Diego; Ortiz-de-Urbina-Criado, Marta; Mora-Valentí­n, Eva-Marí­a (2019). "Knowledge areas, themes and future research on open data: A co-word analysis". Government information quarterly, v. 36, n. 1, pp. 77-87. https://doi.org/10.1016/j.giq.2018.10.008

Dick, George W. A.; Kitchen, Stuart F.; Haddow, Alexander J. (1952). "Zika virus (I). Isolations and serological specificity". Transactions of the Royal Society of Tropical Medicine and Hygiene, v. 46, n. 5, pp. 509-520. https://doi.org/10.1016/0035-9203(52)90042-4

Elmeiligy, Manar A.; El-Desouky, Ali I.; Elghamrawy, Sally M. (2020). "A multi-dimensional big data storing system for generated Covid-19 large-scale data using Apache Spark". arXiv preprint. https://arxiv.org/abs/2005.05036

Emond, Ronald T.; Evans, Barry; Bowen, Ernest-Thomas; Lloyd, Graham (1977). "A case of Ebola virus infection". British medical journal, v. 2, n. 6086, pp. 541-544. https://doi.org/10.1136/bmj.2.6086.541

Google Search (2020). Dataset. https://developers.google.com/search/docs/data-types/dataset

Haleem, Abid; Javaid, Mohd; Khan, Ibrahim-Haleem; Vaishya, Raju (2020). "Significant applications of big data in Covid-19 pandemic". Indian journal of orthopaedics, v. 54, n. 7. https://doi.org/10.1007/s43465-020-00129-z

Hawking, David; Craswell, Nick; Bailey, Peter; Griffihs, Kathleen (2001). "Measuring search engine quality". Information retrieval, v. 4, n. 1, pp. 33-59. https://doi.org/10.1023/A:1011468107287

Hawking, David; Craswell, Nick; Thistlewaite, Paul; Harman, Dona (1999). "Results and challenges in web search evaluation". Computer networks, v. 31, n. 11-16, pp. 1321-1330. https://doi.org/10.1016/S1389-1286(99)00024-9

Hernández-Pérez, Tony (2016). "En la era de la web de los datos: primero datos abiertos, después datos masivos". El profesional de la información, v. 25, n. 4, pp. 517-525. https://doi.org/10.3145/epi.2016.jul.01

Howe, Nicola; Giles, Emma; Newbury-Birch, Dorothy; McColl, Elaine (2018). "Systematic review of participants´ attitudes towards data sharing: a thematic synthesis". Journal of health services research & policy, v. 23, n. 2, pp. 123-133. https://doi.org/10.1177/1355819617751555

Irwin, Richard S. (2009). "The role of conflict of interest in reporting of scientific information". Chest, v. 136, n. 1, pp. 253-259.https://doi.org/10.1378/chest.09-0890

Johansson, Michael A.; Saderi, Daniela (2020). "Open peer-review platform for Covid-19 preprints". Nature, v. 579, n. 7797. https://doi.org/10.1038/d41586-020-00613-4

Karasti, Helena; Baker, Karen S.; Halkola, Eija (2006). "Enriching the notion of data curation in e-science: data managing and information infrastructuring in the long term ecological research (LTER) network". Computer supported cooperative work, v. 15, n. 4, pp. 321-358. https://doi.org/10.1007/s10606-006-9023-2

Khashan, Eman A.; El-Desouky, Ali I.; Fadel, Magdy; Elghamrawy, Sally M. (2020). "A big data based framework for executing complex query over Covid-19 datasets (Covid-QF)". arXiv preprint arXiv:2005.12271. https://arxiv.org/abs/2005.12271

King, John-Douglas; Li, Yuefeng; Tao, Xiaohui; Nayak, Richi (2007). "Mining world knowledge for analysis of search engine content". Web intelligence and agent systems: An international journal, v. 5, n. 3, pp. 233-253. https://dl.acm.org/doi/10.5555/1377776.1377777

Landau, Yuval; Kiryati, Nahum (2019). "Dataset growth in medical image analysis research". Arxiv.org. https://arxiv.org/abs/1908.07765

Le-Guillou, Ian (2020). "Covid-19: How unprecedented data sharing has led to faster-than-ever outbreak research". Horizon. The UE research & innovation magazine, 23 March. https://horizon-magazine.eu/article/covid-19-how-unprecedented-data-sharing-has-led-faster-ever-outbreak-research.html

Lewandowski, Dirk (2015). "Evaluating the retrieval effectiveness of web search engines using a representative query sample". Journal of the Association for Information Science and Technology, v. 66, n. 9, pp. 1763-1775. https://doi.org/10.1002/asi.23304

López-Borrull, Alexandre; Ollé-Castellà, Candela; Garcí­a-Grimau, Francesc; Abadal, Ernest (2020). "Plan S y ecosistema de revistas españolas de ciencias sociales hacia el acceso abierto: amenazas y oportunidades". El profesional de la información, v. 29, n. 2. https://doi.org/10.3145/epi.2020.mar.14

Marcial, Laura-Haak; Hemminger, Bradley M. (2010). "Scientific data repositories on the Web: An initial survey". Journal of the American Society for Information Science and Technology, v. 61, n. 10, pp. 2029-2048. https://doi.org/10.1002/asi.21339

McKiernan, Erin C.; Bourne, Philip E.; Brown, C. Titus; Buck, Stuart; Kenall, Amye; Lin, Jennifer; McDougall, Damon; Nosek, Brian A.; Ram, Karthik; Soderberg, Courtney K.; Spies, Jeffrey R.; Thaney, Kaitlin; Updegrove, Andrew; Woo, Kara H.; Yarkoni, Tal (2016). "Point of view: How open science helps researchers succeed". Elife, v. 5, e16800. https://doi.org/10.7554/eLife.16800.001

Mello, Michelle M.; Lieou, Van; Goodman, Steven N. (2018). "Clinical trial participants´ views of the risks and benefits of data sharing". New England journal of medicine, v. 378, n. 23, pp. 2202-2211. https://doi.org/10.1056/NEJMsa1713258

Nosek, Brian A.; Alter, George; Banks, George C.; Borsboom, Denny; Bowman, Sara D.; Breckler, Steven J.; Buck, Stuart; Chambers, Christopher D.; Chin, Gilbert; Christensen, Garret; Contestabile, M.; Dafoe, A.; Eich, Eric; Freese, J.; Glennerster, R.; Goroff, D.; Green, Donald P.; Hesse, Bradford W.; Humphreys, M.; Ishiyama, John; Karlan, D.; Kraut, A.; Lupia, A.; Mabry, Patricia L.; Madon, T.; Malhotra, N.; Mayo-Wilson, Evan; McNutt, M.; Miguel, Edward; Levy-Paluch, Elizabeth; Simonsohn, U.; Soderberg, Courtney; Spellman, Barbara A.; Turitto, J.; VandenBos, Gary-Roger; Vazire, Simine; Wagenmakers, E. J.; Wilson, R.; Yarkoni, T. (2015). "Promoting an open research culture". Science, v. 348, n. 6242, pp. 1422-1425. https://doi.org/10.1126/science.aab2374

Polonetsky, Jules; Tene, Omer; Finch, Kelsey (2016). "Shades of gray: Seeing the full spectrum of practical data de-intentification". Santa Clara law review. v. 56, n. 593, pp. 593-618. https://digitalcommons.law.scu.edu/cgi/viewcontent.cgi?article=2827&context=lawreview

Qian, Xiaoyuan; Bailey, James; Leckie, Christopher (2006). "Mining generalised emerging patterns". In: Sattar, Abdul; Kang, Byeong-Ho (eds.). Australasian joint conference on artificial intelligence. Berlin, Heidelberg: Springer, pp. 295-304. ISBN: 978 3 540 49788 2 https://doi.org/10.1007/11941439_33

Saheb, Tahereh; Izadi, Leila (2019). "Paradigm of IoT big data analytics in healthcare industry: a review of scientific literature and mapping of research trends". Telematics and informatics, v. 41, pp. 70-85 https://doi.org/10.1016/j.tele.2019.03.005

Schneier, Bruce (2012). "Securing medical research: A cybersecurity point of view". Science, v. 336, n. 6088, pp. 1527-1529. https://doi.org/10.1126/science.1224321

Science Europe (2019). Plan S: Making full and immediate Open Access a reality. https://www.scienceeurope.org/coalition-s

Singhal, Ayush; Srivastava, Jaideep (2013). "Data extract: Mining context from the web for dataset extraction". International journal of machine learning and computing, v. 3, n. 2, pp. 219-223. https://doi.org/10.7763/IJMLC.2013.V3.306

Wang, C. Jason; Ng, Chun Y.; Brook, Robert H. (2020). "Response to Covid-19 in Taiwan: big data analytics, new technology, and proactive testing". Jama, v. 323, n. 14, pp. 1341-1342. https://doi.org/10.1001/jama.2020.3151

Weston, Sara J.; Ritchie, Stuart J.; Rohrer, Julia M.; Przybylski, Andrew K. (2019). "Recommendations for increasing the transparency of analysis of preexisting data sets". Advances in methods and practices in psychological science, v. 2, n.3, pp. 214-227. https://doi.org/10.1177/2515245919848684

Zhou, Chenghu; Su, Fenzhen; Pei, Tao; Zhang, An; Du, Yunyan; Luo, Bin; Cao, Zhidong; Wang, Juanle; Yuan, Wen; Zhu, Yunqiang; Song, Ci; Chen, Jie; Xu, Jun; Li, Fujia; Ma, Ting; Jiang, Lili; Yan, Fengqin; Yi, Jiawei; Hu, Yunfeng; Liao, Yilan; Xiao, Han (2020). "Covid-19: challenges to GIS with big data". Geography and sustainability, v. 1, n, 1, pp. 77-87. https://doi.org/10.1016/j.geosus.2020.03.005

Published

2020-12-21

How to Cite

Blázquez-Ochando, M., & Prieto-Gutiérrez, J.-J. (2020). Ribonucleic acid (RNA) virus and coronavirus in Google Dataset Search: their scope and epidemiological correlation. Profesional De La información, 29(6). https://doi.org/10.3145/epi.2020.nov.28

Issue

Section

Artí­culos de investigación Covid-19 / Covid-19 research articles