A search strategy for publications in interdisciplinary research

To retrieve the right collection of publications in interdisciplinary research, we have developed a search strategy with four progressive steps and take the area of public affairs (PA) as a case study. A set of seed publications in PA is first identified, followed by the construction of a pool set of publications with wider coverage for refinement in the next step, which is critical and in which an expanded set of publications is established on the basis of the references and text semantic information, thus generating two respective subsets. One of these subsets is obtained on the basis of the number of references shared between each publication pair between the seed set and the pool set. To optimize the re-sults, we construct two models, viz. a support vector machine (SVM) and a fully connected neural network (FCNN), and find that the FCNN model outperforms the SVM model. The second subset of publications are collected by selecting the publications with high topic similarity to the seed publications collected in the first step. The final step is to integrate the seed publications with the expanded publications collected in steps 1 and 3. The results show that PA research involves an extremely wide range of disciplines ( n = 45), among which public administration, environmental sciences, economics, management, and health policy and services, among others, play the most significant roles.


Introduction
Scientific research involving multiple disciplines/fields plays a significant role in the development of science and the social sciences, thus leading to numerous studies on interdisciplinary research from various perspectives (Lovelock;Margulis, 1974;Wagner et al., 2011;Wang et al., 2015;Glänzel;Debackere, 2021;Ledford, 2015).How to define the disciplinary attribution of interdisciplinary research remains a challenge because the involved knowledge is from two or more disciplines (Klein et al., 1997;Repko et al., 2007;OECD, 1972;Rotolo et al., 2015) that may vary depending on the specific research topic (Glänzel et al., 2016).It thus becomes difficult to retrieve interdisciplinary publications on the basis of existing classification schemes such as Web of Science (WoS) subject categories.Working at the level of scientific journals, the WoS scheme consists of approximately 250 research areas equivalent to subject categories.A multidisciplinary journal may be assigned to one or more research areas (Leydesdorff;Ràfols, 2009).To improve the WoS journal-based system, some science and technology organizations have developed their own, journal-level classification systems (Archambault et al., 2011;Glänzel;Schubert, 2003;Boyack;Klavans, 2010).Such classification schemes, however, cannot be used directly to define interdisciplinary publications.
Classification schemes at the publication level may better reflect research subjects and have been explored at a relatively small scale in the early period (e.g., Griffith et al., 1974;Small;Griffith, 1974;Small;Sweeney, 1985), until 2010 when a

Funding
This study received financial support from the National Natural Science Foundation of China (NSFC, no.71843012).large-scale publication-level classification scheme was proposed (Boyack;Klavans, 2010).On the basis of direct citation relationships between publications, the Leiden classification system covers scientific fields (Waltman;Van-Eck, 2012) and is composed of three hierarchies, with 20 research areas at the first level, 672 research areas at the second level, and 22,412 research areas at the third level.Working at publication level instead of journal level, the Leiden classification system can classify specific scientific areas in more detail and thus match the current structure of scientific research more closely.Using the Leiden methodology, interdisciplinary publications can be harvested by following simple steps.The requirements in terms of computing time and memory usage, however, are too demanding for individual researchers using a standard desktop computer.In addition, the Leiden methodology does not consider indirect citation relations, which may lead to an incomplete collection of interdisciplinary publications.
To collect interdisciplinary publications on specific topics such as innovation systems, big data, or public affairs, five methods can be considered.The first is to identify historical core contributions and community members in a specific area (Fagerberg et al., 2012).With an implicit assumption that the historical core literature remains stable over time, such an approach is not able to reflect the dynamic changes in a specific field.The second relies on a set of keywords and has wider applications (e.g., Huang et al., 2011;2015;2019;Ruiz-Navas;Miyazaki, 2018;Liu et al., 2021).Variations exist in keyword approaches; For example, some researchers regard keywords as a core set that defines a specific field (Huang et al., 2011), which requires strong knowledge of the field and the consideration of emerging terms.By making full use of the feature of semi-automatic iteration of keyword variants, some researchers first apply a keyword retrieval strategy to identify a core set of publications and then supplement or modify this core set by using high-frequency words (e.g., Liu et al., 2021;Suominen et al., 2016;Shapira et al., 2017), consulting experts in related fields (Huang et al., 2015), or using synonyms and subordinate words through a retrieval formula (Ruiz-Navas; Miyazaki, 2018).The third is to expand related publications through direct citation relations among publications or journals (e.g., Waltman; Van-Eck, 2012;Muñoz-Écija et al., 2019;Bassecoulard et al., 2007), which requires high computational and memory capacity.To avoid such capacity demands, researchers start with a core set of publications and then collect publications linked by citing and cited relationships (Wang et al., 2019).However, this may result in a large number of redundant documents with low relevance to the research topic being included in the core document set.In addition, starting from a core set of publications, the fourth takes advantage of co-citation relations and bibliographic coupling (Kostoff et al., 2006;Zitt;Bassecoulard, 2006;Alencar et al., 2007;Soós;Kiss, 2020;Zhou et al., 2019).This approach, however, is sensitive to the coverage of the starting corpus, in addition to the time lag of citations (Mogoutov; Kahane, 2007).
As a comprehensive retrieval strategy with multiple rounds of iterations (Glänzel, 2015), the fifth includes two critical tasks and is more accurate than the previous four.
Step 1 is to determine a core set of documents (Zitt;Bassecoulard, 2006) that can represent the subject in question well, and step 2 is to extend this core set with relevant documents on the basis of thematic similarity through different approaches.To ensure that the documents collected in the second step are relevant to the subject area, a bibliometric-added retrieval method was proposed, combining complex structures instead of individual search terms.All types of search fields, including keywords, terms, subject headings, journal titles, citations and references, and even organization addresses and author names/identifiers can be incorporated into this retrieval strategy.Taking the approach proposed by Rakas and Hain (2019), for example, they first collected a seed set of publications successively through keyword retrieval and highly cited publications, and then collected a set of "relevant documents" on the basis of overlapping bibliography (i.e., two articles with a high degree of overlap with a larger number of identical references).The two sets of publications are integrated as a final corpus for subsequent analysis.Duplicates and the publications that were not cited at least once are excluded.However, defects can exist in the process of publication collection because documents that are important but that have not yet attracted community attention may be missed.Moreover, extracting 500 publications with the highest bibliographical overlap with each seed publication may result in the inclusion of publications that are irrelevant or less relevant to the topic under consideration.
The diversity and variety of interdisciplinary research (Leydesdorff et al., 2019) result in difficulty when it comes to the classification and retrieval of interdisciplinary publications.Most existing search strategies have pros and cons.The current paper proposes a general search strategy that can be applied to any interdisciplinary subject or field, and then subsequently applies this strategy to the discipline of public affairs (PA), a typical interdisciplinary area.

A general search strategy
The disciplines or fields involved in interdisciplinary research may vary significantly in terms of their number and type, resulting in various search strategies, such those introduced above.In the case where no single approach is widely accepted, it becomes necessary to propose a framework that is applicable to any interdisciplinary research field.Below, we first introduce such a general search strategy with four steps, and then apply the strategy to the publication collection of public affairs.Compared with other interdisciplinary subjects or fields, the interdisciplinary situation of public affairs is more complex in terms of both the variety and diversity of the disciplinary/field.A search strategy applicable to PA can thus be easily adjusted to other interdisciplinary subjects/fields with less diversity or variety.
A strategy should consider both recall and precision.According to Bradford's law and by summarizing the pros and cons of existing methods, we propose a framework with four steps: Emphasizing precision, the first step is to construct a seed set of publications.The second and third steps take recall into consideration but with different focuses: step 2 aims to construct a pool set of publications so as to enlarge the coverage, whereas step 3 aims to construct an expanded set of publications by extracting publications from the pool set that are similar to the target discipline or field.In the fourth step, a final corpus of publications is obtained by integrating the seed publications and expanded publications.
The seed publications collected in the first step should be the most representative of the target discipline or field, to ensure the relevance of the expanded publications collected in next steps.The methods to be chosen are dependent on the degree of interdisciplinarity, for instance, with or without a clear core field/discipline.For those with a clear core discipline or field (e.g., public affairs or digital finance), a list of representative journals -core journals according to Bradford's law, subject category, or other combined methods-can be applied; for those without a definite core field or discipline (e.g., digital governance or innovation systems), a retrieval method using a set of topic words can be applied.To ensure the representativeness of the obtained collection, other bibliometric methods (e.g., top-high citations) can be applied.
To establish a pool set of publications with wider coverage, a well-accepted subject classification system with the targeted interdisciplinary category can be used.In addition to that of the WoS, national and organizational subject classification systems may also work.For example, the classification system of the National Natural Science Foundation of China (NSFC) is used in the current study.All the mentioned classification systems are based on journals.
Nevertheless, the problem with such types of classification is obvious, because not all publications in journals of the same interdisciplinary subject category are necessarily similar.Thus, publications with similarity in the pool set should be identified and extracted, which is the task of step 3: constructing an expanded set by assessing the subject similarity of publications in the pool with each of the seeds.An assessment of either reference similarity or topic similarity of the publications in the two sets (i.e., the pool set and the seed set) can serve this purpose.To assess reference similarity, publications in the two sets are paired to calculate dynamic thresholds for the number of common references by adopting models such as a support vector machine (SVM) or a fully connected neural network (FCNN).The operational results will indicate which model should be used.On the basis of the dynamic threshold, similar publication pairs can be identified.To evaluate topic similarity, a text semantic similarity algorithm (Jaccard, Word2vec, and TF-IDF) may be an option.

The case of public affairs
After more than 100 years of development, the discipline of public affairs has continuously absorbed knowledge from numerous disciplines/fields, such as political science, management, law, psychology, sociology, economics, and information science (Benz, 2005;Rodgers;Rodgers, 2000;Fleisher;Blair, 1999;Harris;Moss, 2001a;Harris;Fleisher, 2005), with the results of PA being regarded as a "borrowing discipline" (Stallings, 1986).With economic development and social progress, PA has been facing increasingly new problems and challenges (Raadschelders, 2011).Many sophisticated policy issues simply cannot be addressed within the narrow boundaries of traditional disciplines.Absorbing knowledge from a broader scope of subjects, including the natural sciences and engineering, has accompanied the development of PA as a discipline, and thus further increases its interdisciplinarity (Kettl;Milward, 1996).Some "distant" subjects, such as computer science, mathematics, and statistics, have become an inexhaustible resource for PA (Savage, 1974).Undoubtedly, the extensive assimilation of knowledge contributes significantly to the development of PA; in the meantime, however, there is difficulty in delineating the field composition of the discipline (Yang, 2019;Steckmest, 1982;Harris;Moss, 2001b).Failing to clarify this field/disciplinary attribution may limit researchers' reference scope and thus affect the development of the discipline, weakening scholars' sense of field belonging and even pushing PA into a crisis of discipline extinction over time (Ostrom, 1974;White et al., 1996;Seibel, 1996;Denhardt, 2004).In addition, the field variety within PA is significant owing to the broad inclusion of fields or disciplines such as public administration, environment protection, social security, digital governance, etc. Different public affairs universities may vary in field coverage, which results in enormous challenges to benchmarking practices (e.g., university ranking and performance evaluation) because of the dependence of the field on research evaluation (Hicks et al., 2015;Gómez-Núñez et al., 2014).To collect publications in public affairs, we follow the general search strategy proposed above.The detailed framework is illustrated in Figure 1.

Step 1. Defining a seed set of publications
Public affairs involves various relationships surrounding governments (Fleisher, 2001;Harris;Fleisher, 2005;Lerbinger, 2006), with public administration as the core.Thus, public administration can be regarded as the core field of public affairs.Articles indexed in the WoS and published in the years 2000-2019 in the public administration subject category are first collected.Among the obtained publications, only those with top 10% citations in 5 years or per year are considered as the seed set of publications of public affairs.After removing duplicates, 4,664 seed publications are confirmed.The historical development and discipline distribution of the seed publications are shown in Figures 2 and 3.Over 10 years, the seed publications of PA have increased about seven fold.PA itself is multidisciplinary, involving more than 10 disciplines, and more than half of the seeds involve research in the public administration field.Political science is also one of the main fields of PA.

Step 2. Constructing a pool set of publications
Defining the seed set of publications by using highly cited articles may result in a lack of knowledge regarding other important contributions.It is necessary to construct a pool set of publications with wider coverage by searching the journals related to public affairs.The list of 91 journals in the current study is provided by the Development Strategy for the Discipline of Public Affairs (Xue et al., 2017), an output of a key project of the National Natural Science Foundation of China.On the basis of the journal list that is widely accepted by universities, as well as the practice of Yu (2019), four journals (energy, applied energy, agricultural water management, and energy economics) were eliminated owing to the keyword clustering analysis topics of the publications in the four journals mainly being related to energy, resource, and environmental technology, with little relevance to public affairs.Furthermore, the publication sizes of these four journals are unstable and have increased significantly from several hundred to more then 4,000 since 2010, which may affect the statistical results of PA research.The other 87 journals are indexed in the Web of Science (WoS), and include a total of 140,806 articles published in the period 2000-2019.In other words, the pool set comprises 140,806 articles.The wide disciplinary coverage of publications in the pool set is presented in Table 1.The public administration field does not take first position anymore; instead, environment-related fields and economics take the lead.The purpose of enlarging the disciplinary coverage has been reached in a reasonable way.

Step 3. Establishing an expanded set of publications
The wider coverage of the pool set may include less or irrelevant documents.To ensure the publications are indeed relevant to public affairs, we adopt two methods to establishing an expanded set of publications, namely measuring the number of shared references and measuring the topic similarity between the seed publications and those in the pool set.Measuring shared references is more complicated and takes longer to illustrate.
Measurement of shared references.The references of the seed publications and the pool publications are compared in pairs.The number of shared references is used to determine whether two publications are similar.With regard to the appropriate threshold for the shared number of references, variation exists among different studies.The detailed operation is as follows: we first randomly select a sample of 500 publications from the seed set, and then select 500 publications from the pool set and pair them with each initial "seed."A dataset of 500 publication pairs (one-to-one matching) is obtained.Second, on the basis of the similarity of the title, abstract, keywords, and other information, we manually label whether each pair of publications is similar (1 for similar and 0 for not similar pairs).In the meantime, the number of references of each pair and the number of shared references are recorded.Rakas and Hain (2019) define the top 500 publications with a high overlap rate with the references in the pool set as the expanded publications of each seed publication.The disadvantages of this method are obvious: a lack of knowledge of citation disparities among document types, the time length of publication, and research topics.Such disadvantages would become severe when the number of shared references is small and where the number of references between publications varies greatly.The similarity between the target publications (a seed publication and a pool publication) in cases A and B in Table 2 would be considered equal on the basis of an absolute threshold, although it is clear that significant variation exists.
Could a relative threshold be a better solution?We use three cases, presented in Table 3, for further understanding.
When the difference in the number of references between the two compared papers is small, a relative threshold can be set to roughly determine the similarity on the basis of a smaller number of references [i.e., similarity = 20% × Min (Num-Seed R, Num-Pool R)].Thus, in case D, the number of references shared between a seed publication and a pool publication is 28.When the difference is large, however, a relative threshould does not work, for instance, case E in Table 3.
Given that neither absolute nor relative thresholds can be used to reasonably determine publication similarity, this study tries to construct high-dimensional models for the determination of publication similarity.Two models, a support vector machine and a fully connected neural network, are constructed and compared to select the one that is optimal.The result of each model is compared with that of the method proposed by Rakas and Hain (2019).
On the basis of sample records of 500 publications selected randomly and indexed manually, a formula is developed to simulate the relationship reflected by the number of references of each of the seed publications (X), the pool publication (Y), and the shared references (Z).The number of shared references (Z) of each pool publication and the seed publication are identified manually.The labeling variable is a dichotomous variable: whether each pair of publications is similar (1 for similar pairs and 0 for dissimilar pairs).It is well known that the number of references of individual publications varies significantly, from a few to tens or hundreds.To make it so that the selected publication is able to represent publications with different numbers of references, and thus to ensure the estimation accuracy and application scope of the model, we carried out five steps.
Step 1 was to arrange the number of references of each seed and each pool publication in descending order and exclude those with fewer than two references.
Step 2 was to classify publications on the basis of their number of references into subset 1 (2-100 references), subset 2 (101-200 references), etc.The determination of the interval of the number of references is empirical.
Step 3 was to randomly select publications from each subset, and in the end, a total of 500 sample publications from the seed and pool publication set, respectively, were obtained.
Step 4 was to pair-match each of the sample publications in the seed set with each of those in the pool set, whereby a set of 250,000 pairs was obtained (500 × 500 = 250,000).
Step 5 was to arrange the number of shared references of each pair in descending order after excluding those with fewer than two shared references, to classify publication pairs on the basis of their number of shared references into subset 1 (2-20 shared references), subset 2 (21-40 shared references), etc., and to select randomly a total of 500 publication pairs from the subsets.
Of the 500 publication pairs, the mean values of the number of references of the seed and the pool publications are 89.6 and 97.0, respectively, while the correponding modes are 62 and 61, respectively (Table 4).In combination with the results of the analysis of extreme value and standard deviation, one can see that the mean values are significantly affected by extreme values, and the data sampling is relatively dispersed, which is conducive to improving the accuracy of the subsequent model estimation.More specifically, the samples were divided into two groups according to similar pairs (22.0%) and dissimilar pairs (78.0%) of publication pairs.The mean values of X and Y of the similar publication pairs are greater than those of the dissimilar pairs, which implies that a publication pair with a larger number of references may have more shared references (similar pairs: Z mean = 20.5) and thus be more likely to be judged as a similar publication pair.Significant correlations between different variables do exist but with low values (less than 0.6, see Table 5).The correlation between X and Y (0.55) is higher than that between Z and X, Y (0.514 and 0.453, respectively).The correlations of the dependent variable with X and Y (0.093 and 0.041, respectively) were much lower than that with Z (0.488).

Model 1: support vector machine
The support vector machine (SVM) is a supervised machine learning model for data classification that works by identifying the optimal hyperplane on the basis of the principle of maximum interval in the feature space (Cortes;Vapnik, 1995).(1) The normal vector of the hyperplane is denoted as w = (w 1 , w 2 , …, w s ), and the displacement as b.Finding the optimal hyperplane can be transformed into an optimization problem: (2) As completely linearly separable cases are comparatively rare in reality, we can lower the standard for the support vector machine from completely linearly separable to tolerate slight classification errors by adding a slack variable ξ(ξ≥ 0) and a penalty term C (C > 0) to the original linearly separable support vector machine.The optimization problem can be solved by: (3) The model has a higher level of fault tolerance when C = 1.To solve Eq. ( 3), the Lagrange function can be constructed by adopting the Lagrange multiplier method, then the optimal solution to Eq. ( 3) can be obtained by solving its dual problem: (4) The sequential minimal optimization (SMO) algorithm can then be used to solve for α.Subsequently, w is solved via the equation The final function is: (5) The hyperplane fitting result is shown in Figure 4. (The data presented in Figure 4 are the training samples.)

Model 2: fully connected neural network
A fully connected neural network (FCNN) is often used in data classification, information recognition, and prediction because of its strong self-learning and fitting abilities.It can fully approximate any complex nonlinear relationship (Liu et al., 2018;Raiyani et al., 2018).The FCNN constructed in the current study contains an input layer, an output layer, and two hidden layers.The training and testing datasets are the same as for the SVM.The input of each layer can be regarded as the output of the previous layer.The process of network generation consists of two steps in opposite directions, that is, forward propagation to calculate the model loss and backward propagation to update the model parameters.Finally, the model parameter value under the condition of minimum loss is obtained (Figure 5).
The forward propagation process of the network is to take the vector x = [x 1 , x 2 , x 3 ] T as the input data and to combine it linearly with the weights w 1 , w 2 , w 3 , and the bias term b.The nonlinear transformation is achieved by a linear rectification linear unit (ReLU) function, and the output vector z 1 is obtained and then used as the input data for the first hidden layer: (6) where W 1 ∈R 3*n , B 1 ∈R n , and n is the number of neurons (32 neurons in this study) in this layer.The output vector of the first hidden layer can be regarded as the input vector of the second hidden layer.The formula for the second hidden layer is as follows: (7) where W 2 ∈R n*m is the output weight matrix, B 2 ∈R m is the bias vector, and m is the number of neurons (16 neurons in this study) in this layer.To prevent the model from overfitting and to improve the generalization ability of the network, a dropout layer is added to randomly stop updating the weights of some neurons (20% of neurons).The output vector of the second hidden layer is then used as the input vector of the output layer: (8 where W 3 ∈R m*2 , B 3 ∈R 2 .The network can be regarded as a nonlinear composite function: (9) Through a softmax function, the output vector is transformed into two probability values (between 0 and 1), and the larger one can be converted into the prediction value 0 or 1 for judging the similarity between the seed publications and The process of backpropagation is applied to optimize the model through the gradient descent algorithm.Common gradient descent algorithms include full gradient descent (FGD) and batch gradient descent (BGD) with relatively stable parameter updating direction but slow convergence speed (Huo;Huang, 2017).Stochastic gradient descent (SDG) improves the convergence speed of the first two methods by selecting training samples randomly in each iteration, but has the disadvantage of unstable parameter updating (Bottou, 2012).The adaptive gradient algorithm (AdaGrad) can obtain efficient and relatively correct results by constantly adjusting the learning rate in the iterative process (η), although calculation errors caused by the sharp decrease of the gradient are hard to avoid (Liu et al., 2020).In the current study, an optimization algorithm using root-mean-square propagation (RMSprop) of AdaGrad is used to construct the prediction model with a learning rate of 0.001 (Duchi et al., 2011;Dauphin et al., 2015).The formula is as follows: (10) where δ, θ, ρ, and r are the constant value, model parameter, the exponential decay rate, and the gradient accumulation, respectively.The final training model is obtained with 97.5% prediction accuracy on the basis of the test data.

Constructing two sets of expanded publications
Before running the above models, the follow steps are required to identify the shared references.The first step is to extract the shared references by comparing the DOIs of each reference pair between the seed publications and the pool publications.For references that cannot be matched by DOIs, other information in the CR field (i.e., author, year, journal, volume, and issue) can be used for fuzzy matching by adopting methods such as ignorance of punctuation differences, N-gram, and cosine (Abdulhayoglu et al., 2016;Glanzel;Czerwon, 1996;Sen;Gan, 1983).The current study adopts the method of ignorance of punctuation differences.By running the above models (SVM and FCNN), two sets of expanded publications are obtained (Figure 6).The training of the SVM or FCNN was performed on a NVIDIA GeForce GTX 1070 with an AMD Ryzen 1700.Sklearn 1.0.2 and Tensorflow 2.5.0 were used for the SVM and FCNN, respectively.For a computer with a memory of less than 500M, it takes less than 60 seconds to complete the training for each of the models.The three sets of expanded publications obtained by using the methods of Rakas and Hain (2019), SVM, and FCNN are shown in Figure 7. Overlaps exist between any two datasets, but with no full containment relationship.With 88,221 documents, the coverage of the method of Rakas and Hain is the widest, followed successively by that of SVM (22,138) and FCNN (15,925).
The accuracy and comprehensiveness of the three sets are presented in Table 6.The evaluation report of the models (e.g., accuracy, precision, recall, f1-score) and topic similarity (i.e., Jaccard, Word-2vec, and TF-IDF) represent the fitting effect of each model and the semantic similarity between each seed document and its expanded documents, calculated according to the information in the title, abstract, keywords, and keywords plus.

Publication expansion based on topic similarity
Selecting expanded publications simply on the basis of shared references may neglect publication pairs that do not have enough shared references but do study similar topics.Therefore, we apply both shared references and topic similarity to screen out publications in the pool set.In other words, publications in the expanded set are screened out firstly by shared references and then by topic similarity.The topic similarity method is a two-step process.
Step 1 is to calculate the topic similarity between the expanded publications obtained through shared references and their corresponding seed publications.The value of 25% (in line with Table 6) will be used to screen out publication pairs in step 3.
Step 2 is to pair the rest of the publications (excluding those already screened out through shared references) in the pool set with those in the seed set, and calculate the TF-IDF value of each pair.
Step 3 is to screen out those with topic similarity greater than 25% and add them to the expanded set.
By integrating the expanded publications on the basis of shared references and topic similarity, an expanded set ( public, environmental and occupational health, information science and library science, etc. (Table 7).Compared with the pool set, the importance of some disciplines in this set has increased, for example, public administration (8.8%), health policy and services (4.8%), and management (4.6%).

Step 4. Construct a final corpus of publications
Finally, the seed publications and expanded publications are merged together by excluding duplicates, reviews, and uncited publications (Figure 8).The final corpus contains a total of 55,345 publications, including 45 WoS subject categories and 116 journals and research relating to scientific research management and innovation research, public health and health, safety and risk management, organizational management and human resource management, land resource management, water resource management, climate change, energy and environmental governance, governance research, education, social issues research, and other content connected to public affairs.
In short, the advantages of the multistep method are obvious.The seed set contains the core content of PA, while the expanded set considers its dynamic expansion.Thus, as a result of inclusiveness, it will not provide a specific boundary for PA, but rather a blurry one that tends to include publications from adjacent fields that are also substantially interrelated to PA.

Discussion and conclusions
To solve the problem of how to retrieve publications in interdisciplinary research, this study proposes a data-driven and multi-round iterative retrieval strategy that may ameliorate the deviation of retrieval results caused by single-way methods such as subject retrieval or journal retrieval.Selecting a reasonable method for expanding publication coverage is critical for the retrieval strategy, directly affecting the accuracy of the final retrieved results.Compared with the methods of setting an absolute number or relative threshold (e.g., Rakas; Hain, 2019) for expanding coverage, the FCNN model performs better because it addresses the difficulty of comparing the similarity of publications caused by large differences in the number of references.Furthermore, it improves the accuracy of determining similar publications.The proposed search strategy can be applied for the retrieval of any interdisciplinary publications.Adjustment, however, has to be done according to the characteristics of the target disciplines.
The third step of the search strategy -establishing an expanded set of publications-is somewhat complicated and requires knowledge of computer science and bibliometrics, which may, to some extent, limit the application of this methodology.To solve this problem, we plan to develop user-friendly guidelines and to make them freely accessible, so that researchers in fields with little knowledge of computer science and bibliometrics may apply the guidelines to collect an expanded set of publications in the interdisciplinary area of their choosing.

Figure 1 .
Figure 1.Search framework for publications in public affairs

Figure 2 .
Figure 2. Annual development of the seed publications Figure 3. Discipline distribution of the seed publications Compared with methods such as setting percentage or quantity thresholds, this model can capture high-dimensional data features effectively.Since its inception, the SVM has been widely used in areas such as facial recognition, text classification, data classification, etc. (Qin; He, 2005;Sun et al., 2002; Chen et al., 2001;Campbell et al., 2006;Srivastava;Bhambhu, 2010).This study proposes to approach the data binary classification issue by using the SVM model.We divided the sample data into two parts, consisting of the training data (400 publication pairs) for estimating the model parameters, and the testing data (100 publication pairs) for evaluating the model accuracy.The training and test data were randomly selected in a ratio of 4:1 for model fitting for 10 times.The given training samples (D={(x 1 , y 1 ), (x 2 , y 2 ), …, (x s , y s )}, where s is the number of samples) are put into either negative or positive categories as labeled [y ∈ (−1, 1)].The model can identify the optimal hyperplane, which requires maximizing the sum of sample-plane distances by dynamically adjusting the hyperplane parameters after placing all the training samples onto both sides of the hyperplane.The expression for the hyperplane is

Figure 4 .
Figure 4. Support vector machine model

Table 1 .
Distribution of top 20 disciplines in the pool

Table 2 .
Determination of similar publications based on an absolute threshold Note: Calculated according to the method of Rakas and Hain (2019).

Table 3 .
Determination of similar publications based on a relative threshold

Table 4 .
Descriptive statistics on the number of references of 500 publication pairs

Table 5 .
Spearman's correlations of variables Note: *5% and **1% significance levels.The numbers shown in parentheses correspond to p-values.

Table 6 .
Results of the three publication expansion models

Table 7 .
Distribution of the top 20 disciplines in the expanded set