Comparison of metadata with relevance for bibliometrics between Microsoft Academic Graph and OpenAlex until 2020

Microsoft Academic Graph (MAG) has been studied a lot concerning its suitability for bibliometric evaluations. In May 2021, it was announced that it would retire on December 31, 2021. Soon after that, the non-profit organization OurResearch, aiming at providing 'a fully open catalog of the global research system', announced they would preserve and incorporate the last full MAG corpus, only excluding patent data, and to continue and hopefully improve it. After the launch of OpenAlex in January 2022, it is of interest to know if the usefulness of the MAG data is preserved or even improved in OpenAlex. To this end, we compared metadata that are relevant for bibliometric analyses (in particular field and time normalization of citations) of MAG and OpenAlex: - the coverage of documents over the years, - the agreement of bibliographic data, - the numbers of references of each document, - the kind and distribution of document types, - the distribution and relation of subject classifications.


Introduction
Since its launch in 2015, Microsoft Academic Graph (MAG; Sinha et al., 2015) had been a promising new data source for bibliometric analyses due to its large coverage and set of available metadata (Harzing & Alakangas, 2017).Therefore, MAG has been the object of many studies, in particular comparisons with other important bibliographic databases.In one of the last and largest ones, Visser, van Eck, and Waltman (2021) compared MAG with Web of Science, Scopus, Dimensions, and Crossref.In May 2021, it was announced by the Microsoft Blog (2021) that the Microsoft Academic website, application programming interfaces, and snapshots would retire on December 31, 2021.Soon after that, the non-profit organization OurResearch, aiming at providing "a fully open catalog of the global research system" (OurResearch, 2021), announced they would preserve and incorporate the last full MAG corpus, only excluding patent data, and to continue and hopefully improve it.Another main source of data should be Crossref.In January 2022, OpenAlex (http://docs.openalex.org)was launched and provided API access to their services as well as data dumps for any purposes.The Curtin University's Open Knowledge Initiative (COKI) has already started to monitor the development of OpenAlex, in particular assessing and comparing the value added by OpenAlex to MAG and to Crossref, both in coverage of publications and other research output (Kramer, 2022).Scheidsteger, Haunschild, Hug, and Bornmann (2018) studied the possibility of using MAG data for the calculation of field-and time-normalized scores.They compared the scores derived from fields of study and coverage in MAG to those derived from subject categories and coverage in Web of Science (WoS).In the present study, we are interested in comparing metadata that are relevant for bibliometric analyses (in particular field and time normalization of citations) of MAG and OpenAlex: -the coverage of documents over the years, -the agreement of bibliographic data, -the numbers of references of each document, -the kind and distribution of document types, -the distribution and relation of subject classifications.

Microsoft Academic Graph (MAG)
We downloaded the Microsoft Academic Graph (MAG) data set via the Microsoft Azure portal at the end of December 2021 and received data timestamped with 6 December 2021 (Sinha et al., 2015;Tang et al., 2008).We were not able to get newer data at the beginning of 2022 after the official expiration date of the MAG service.According to the OpenAlex Migration Guide (OpenAlex, 2021), no patents have been transferred from MAG to OpenAlex.Therefore, we excluded all items with document type Patent from the comparison.In order to facilitate the distinction between the two databases, we keep the case of the document type names as they are used in both databases.In particular, MAG types are written with capital initials.Because MAG data do not contain the full year 2021, we restricted our analyses to the publication years before 2021.Thus, we considered 197,445,041 papers in MAG.

OpenAlex
The For more details on the approach and the structure of OpenAlex see Priem, Piwowar, and Orr (2022).

Coverage of publication years in both databases
Only 777 IDs from MAG are not incorporated in OpenAlex, starting with one item in 1952 and reaching a maximum of 201 in 2020.The document types in MAG of these missing items are about 40% Journal and None, each, and about 15% BookChapter.Over the whole period since 1952, of the 777 MAG IDs, 654 have DOIs, most of them could be found in Crossref.347 of these DOIs contain the ISBN Bookland prefix "978" or "979" and therefore point to books or book chapters, but only one third of them is assigned to the types Book or BookChapter in MAG.The number 777 of missing MAG IDs exactly matches the difference between the overall number of MAG papers and 197,444,264 OpenAlex works that have a MAG ID associated with them.Of the DOIs, 23 had been associated with more than one MAG ID and-apart from one-all could be found in OpenAlex.
There are 1,161,901 works indexed in OpenAlex that have no corresponding record in MAG, 1,108,176 of them having a DOI in OpenAlex, in particular 1,877 documents before 1800, the first publication year in MAG.In the following, only the documents both databases have in common are going to be investigated.
Figure 1 shows the annual numbers of common documents with and without DOI across the years 1980 until 2020.The unexpected decrease of the total number starting in 2017 is due to the shrinking number of documents without a DOI which in turn is by far dominated by the number of documents with no document type assigned.

Comparison of bibliographic data in MAG and OpenAlex
For the 197,444,264 documents in OpenAlex with an ID in MAG we firstly check if the bibliographic data from MAG, like volume, issue, first page, last page, and DOI are preserved after the transfer to OpenAlex.When volume or issue were available in MAG these data have been completely transferred to OpenAlex.This seems also to be the case for first and last pages and DOIs.But during our investigation we found some issues with the (original MAG) data quality: (i) In more than 28,800 cases, the fields "first page" and "last page" contained not a single number but the same range of numbers, e.g., "35-46".(ii) More than 810,028 DOIs occur more than once in the dataset, 7,626 of them at least ten times, and 235 at least 100 times.Of the top 100 most-frequently occurring DOIs, only 29 can be resolved.(iii) More than 6,000 DOIs contain non-latin characters, less than 200 could be resolved.Secondly, concerning the number of (linked) references for a document, we compared the respective values in both databases and found no difference.

Document types in both databases
In MAG, we are dealing with seven document types: Book, BookChapter, Conference, Dataset, Journal, Repository, and Thesis.Nearly 45% of the documents are classified as Journal, but nearly the same number of documents have no document type assigned (None).The more interesting cases are the reclassifications.Therefore, we show in Figure 2 an alluvial diagram of the corresponding document types in both databases, excluding the transfers from Table 3.The alluvial diagram was produced using the software package "alluvial" (Bojanowski & Edwards, 2016) based on R (R Core Team, 2020).Those reclassifications occurring in relevant numbers that sum up to nearly 9.3% of all documents are listed in Table 4.In order to get an impression of the quality of these reclassifications, we add some characteristics of respective random samples of ten documents, each.All of them had a DOIas we could expect because of Crossref being the main source The reclassification to type book-chapter in OpenAlex seems to work fairly well.This is also the case for journal-article.In particular, many documents using non-latin character sets are now getting classified, and a substantial number of items with DOIs that MAG had labelled as arXiv preprints are correctly recognized as journal-article.On the other hand, the assignment of ChemInform abstracts to this document type is debatable, but they are definitely no preprints.Conference papers seem to be a special case: Documents incorrectly assigned to Journal get corrected to proceedings-article, but for documents without a document type in MAG the assignment of proceedings-article is not that accurate or at least difficult to verify.
In case of MAG type Conference, the reclassification to journal-article seems to be overall valid, whereas the reclassification of LNCS contributions to book-chapter seems to be the result of their appearance as part of book series and of the format of their DOIs containing the Bookland prefix "978" (doi.org, 2019).This fact should be kept in mind for bibliometric studies in computer sciences, which probably should include book chapters as well.

Subject Classifications
OpenAlex states in their migration guide (OpenAlex, 2021) that they use the same taxonomy as MAG but have reduced the number of "Fields of Study" (FoS) by removing those with less than 500 papers associated.Moreover, they have applied a different algorithm, i.e. model V1 in their open-source software (Priem & Piwowar, 2022).
A quick look reveals the persistence of all 19 top-level FoSs (level=0) from MAG as well as of 284 of the 292 FoSs of the next level (level=1).Table 5 lists the distribution of all FoS levels from 0 to 5 in both databases.The strongest reduction of FoS numbers occurs in the levels 3 to 5 where less than 10% persist.The total number of FoSs on all levels is 714,971 in MAG and only 65,073 in OpenAlex, which means a reduction to 9.1%.Interestingly, of the 24,768 level-3 FoSs in OpenAlex, more than 4,000 have less than 500 works assigned to them.

Discussion and Conclusions
OpenAlex has transferred practically all works from MAG preserving their bibliographic data publication year, volume, first and last page, DOI as well as the number of references that are important ingredients of citation analysis.
More than 90% of the MAG documents have equivalent document types in OpenAlex.Of the remaining ones, especially reclassifications to the OpenAlex document types journal-article and book-chapter seem to be valid and amount to more than 7%, so that the document type specifications have improved significantly from MAG to OpenAlex.So far, OpenAlex seems to be more suited for bibliometric analyses than MAG.
As last item of bibliometric relevant metadata, we looked at the paper-based subject classification via FoS in MAG and in OpenAlex.We found significantly more documents with a FoS assignment in OpenAlex than in MAG.On the first and second level, the FoS structure is identical resp.nearly identical, but on the deeper levels the number of available FoSs is drastically reduced to about 10%.But this would not pose a problem if using only the upper two levels for bibliometric analyses as was done by Scheidsteger, et al. (2018).However, the reclassifications might cause changes to conclusions of previous studies.The consequences of the proliferation and abundant reclassification of top-level FoSs need to be studied more in detail.Reclassifications at the deeper levels should be studied, too.
Overall, OpenAlex seems to be at least as suited for bibliometric analyses as MAG for publication years before 2021.However, this first impression needs to be checked by further detailed studies.

Figure 1 :
Figure 1: Numbers of common OpenAlex-MAG documents across the years 1980 to 2020

Figure 2 :
Figure 2: Alluvial diagram of document type reclassifications from MAG to OpenAlex

Figure 3 :
Figure 3: Alluvial diagram for the top-level FoS reclassifications from MAG to OpenAlex, showing only reclassifications that occur at least 200,000 times

Table 1 :
Number and percentages of document types in MAG.In OpenAlex, there are 26 document types that inherit their definition from another major data source Crossrefas documented in Crossref's Content Type Markup Guide(Crossref, 2021).Obviously, all works in OpenAlex with a Crossref DOI receive their document type from there.Those document types with a share of more than 1.0% of all documents are listed in Table2.There are additional nine million items in OpenAlex assigned to the document type journal-article as compared to the MAG document type Journal.The OpenAlex items of document type journal-article cover nearly one half of all documents, but the items without a document type (none) are still more than a third of all.However, the document types Journal and None are about equally large in MAG.The increased numbers of journal articles, conference proceedings and book chapters are especially interesting from a bibliometric point of view.

Table 2 :
Numbers and percentages of document types in OpenAlex.

Table 5 :
Distribution of FoSs in MAG and OpenAlexEven if the top-level FoSs persist, they are very differently associated to the papers.For example, one paper (https://api.openalex.org/works/W2178938397,accessedon26 April 2022) had one top-level FoS and one level-1 FoS in MAG, but it has six additional top-level FoSs and one additional level-1 FoS in OpenAlex.The total number of papers with any FoS is significantly increased: 30.5 of 48.9 million documents without any FoS in MAG have at least one FoS in OpenAlex so that the coverage increases from 74.6% to 86.6%.There are 147,360,860 papers with at least one top-level FoS and a total number of 147,426,219 assignments to top-level FoSs in MAG, i.e., 65,359 of the papers have more than one top-level FoS (up to seven).In OpenAlex, there are 170,900,225 works with any top-level FoS, and 229,560,450 assignments to top-level FoSs in total; there are 52,966,153 works with at least two top-level FoSs (up to seven).About 77.2% of all toplevel assignments in MAG persist in OpenAlex, but this proportion varies significantly across the 19 top-level FoSs as Table6clearly showsfrom less than a quarter for Engineering to more than 90% for Material Sciences and Medicine.

Table 6 :
Distribution of top-level FoSs in MAG and percentage of top-level FoSs persistent in OpenAlex.Figure3shows an alluvial plot of the transfer of paper-based subject classifications without the persistent FoS assignments of Table6so that the remaining reclassifications become more visible.Given the fact that all 342 possible reclassifications do indeed occur in our publication set, only the 94 connections with at least 200,000 occurrences are shown.Several reclassifications occur in comparable measures in both directions, e.g., in the pairs Sociology & Psychology, Sociology & Political Science, or Psychology & Medicine.Other ones show a significant transfer in mainly one direction, like Engineering to Computer Science, Mathematics to Computer Science, Biology to Chemistry, or Chemistry to Materials Science.