Bayesian Analysis on the Spatial Difference of Input Risk of Overseas Cases of COVID-19 in China

: To analyze the spatial difference of COVID-19 import risk is helpful for scientific prevention and control. On the basis of clustering 25 provinces and cities with epidemic input in study time, a multinomial distribution model was established under the Bayesian framework. All parameters Bayesian estimation was obtained by MCMC method. 25 provinces and cities with overseas input were divided into 9 categories from March 3 to April 23


Introduction
As of 2020-3-28, there were 11 days in Wuhan, the worst affected area of the epidemic in China, with 0 new cases.On the same day, the Ministry of Foreign Affairs and the State Administration of Immigration issued a notice suspending the entry of foreigners holding valid Chinese visas and residence permits from 0:00 on March 28.2020-4-8, Wuhan lifted the blockade, but the number of imported cases abroad was 1103.By 2020-4-23, 1618 cases had been recorded for overseas imports.China's epidemic prevention work has shifted to the importation of overseas cases (including overseas importation of associated cases) and asymptomatic infection prevention and control.Therefore, the spatial difference analysis of the risk of overseas imported cases can serve the decision-making of epidemic prevention and control in various provinces (cities).
The existing quantitative study of COVID-19 mainly focuses on the description statistics of epidemic trends [1,2] and trend estimation.For examples, estimates of the size of the COVID-19 outbreak were made using SEIR or modified SEIR models [3][4][5], self-regression moving average models (ARIMAs) [6], random transmission models [7], etc; COVID-19 regeneration estimation was carried out using the Marcof Monte Carlo Method (MCMC) [8][9] and so on.Bayesian method has some advantages in data analysis in the field of medicine, the uncertainty parameter value can be quantified under the premise of known data, and the quantitative accuracy of parameters can be improved by a priori information and data information [10].For example, Han Ke et al. [11] used the Poisson distribution model under bayesian framework to estimate the number of COVID-19 regenerations in first-tier cities.On the study of spatial differences in the COVID-19 outbreak, the Johns Hopkins Center for Systems Science and Engineering produced a map of the global outbreak [12], with a maximum of more than 2 billion daily visits.Guan Weijie etc [13] and Qi Cuifang etc [14] carried on the study of nationwide clinic characteristic of epidemic situation and COVID-19 inter-provincial communication and influencing factor analysis respectively, and both of them used GIS mapping to reflect the distribution of confirmed cases in each province and city.Yi Dali [15] and others carried out cluster analysis with the epidemic data in 34 provinces and cities from 2020-1-19 to 2-16, a total of 6 categories, among them, it was the high risk areas of Hubei Province and Henan Province that needed to be strictly controlled.According to Tencent location data and Baidu migration data, Liu Zhang [16] et al. have completed the spatial distribution estimate of people who moved out of Wuhan during the COVID-19 outbreak.In the scale of Wuhan City 1140 traffic analysis area, Feng Mingxiang [17] and others carried out COVID-19 space-time diffusion estimate combined with mobile phone user space interaction data.GIS, multi-source data and big data platform are effective methods for the study of COVID-19 outbreak simulation and spatial distribution differences.However, the analysis and processing of zero expansion data, missing data and short-term data by these methods often result in a great deviation from the actual situation.At present, foreign imported case data in China have the characteristics of zero expansion, geographical absence, data size differences, the result may deviate from the actual situation of the current domestic and foreign input if using the traditional method for analysis and processing.Based on the number of overseas imported cases from 2020-3-3 to 4-23 and the clustering results, this paper constructs a multinomial distribution model of the probability of input risk in each province and city under the framework of Bayesian, solves the model by MCMC method, and analyzes the spatial differences in the input risk of cases of COVID-19 outside China, The research results are expected to serve for epidemic prevention and control abroad.

Data and Methods
Collect daily new case data from 25 provinces (cities) in China involved in overseas imported cases from 2020-3-3 to 4-23 as a sample (data from the Bulletin of the National Health And Health Commission and the Provincial and Municipal Health and Construction Commission).The zero expansion characteristics of this data are obvious, the daily data of provinces and cities are different, and the regional differences are obvious.For example, among 25 provinces and cities, Heilongjiang imported 86 cases in 2020-4-7, and it was the largest number of imported cases day by day, however there were 0 imported cases in 18 provinces in the same day.From the frequency (proportion) point of view, the frequency of foreign import cases was 0, but this could not be explained that there were no risk of overseas import in 4-7 in these 18 provinces and cities.If we choose the methods used in the literatures to deal with these data, it inevitably results in most time points not in line with the actual situation.
By using two clustering methods to cluster data from 25 provinces and cities with overseas outbreak input in China, the results can be obtained relatively close.Based on the cluster results, the model of the probability of input risk abroad under Bayesian is established, the appropriate Dirichlet distribution is used as its priori distribution, the model parameters (input risk) are solved by MCMC method, and the GIS mapping is used to reveal the spatial difference of the input risk of COVID-19 cases abroad.The software used are R language and OpenBUGS, and the map mapping software are Adobe Illustrator and Photoshop.

Clustering of Imported Cases from Abroad
Programs are written in R language to cluster sample data (2020-3-3 to 4-23) from 25 provinces and municipalities.The name of the province and city is considered an indicator, the similarity coefficient between the two indicators , , ( ≠ the square root of two norm of the vector [18], and the distance between the two variables is = 1 − .
Usually the results of different clustering methods are different, and the determination of the number of classifications is not yet fully resolved.In the actual study, according to the purpose of the study, we generally choose a variety of methods of clustering, through comparative analysis, to determine the final clustering method and classification number to get better results [18].
Through the comparative analysis of various clustering results, it is found that the clustering results of the class average method (Average) and the similar method (Mcquitty) are close, and the average method of the class is not concentrated or expanded, which is the clustering method recommended by many literatures [19].By combining the advantages of the two clusters, the following clustering results are obtained: Category I (G ): Zhejiang, Beijing, Shanghai, Guangdong, Sichuan, Shandong, Yunnan, Guangxi, Fujian, Tianjin, Liaoning, Jiangsu, a total of 12 provinces and cities.The corresponding sample data from 2020-3-3 began to form a data matrix of the lower trapezoidal structure, the largest number of cases, up to 917 cases.
Category III (G ): Shaanxi, Jilin, 2 provinces.Input cases were concentrated in 3-21 to 4-23, with continuous importation of overseas cases, a total of 49 cases.
Category VI G ! : Inner Mongolia.Input cases were concentrated in 3-24 to 4-15 cases, a total of 118 cases.
Category VII G " : Gansu.Input cases were concentrated in 3-5 to 4-5 cases, a total of 47 cases.
Category VIII G # : Henan.Input cases were concentrated in 3-11 to 3-25 cases, a total of 3 cases.
Category IX G $ : Anhui.Only one input case in 4-8.Note: Use Chinese Pinyin as the variable name of 25 provinces or cities, in order to avoid the same name in Pinyin, use shangxi to represent Shanxi

Construction of Model
Any confirmed case of new import outside the province i of t-day is indicated by % & , the '( & indicates the number of new cases imported from outside on the t-day, and the number of cases entered from the province of category j on t-day is indicated by . The risk of foreign input for category j in t-day is defined as: . This probability expresses the risk of foreign input for category j in t-day.The higher the value, the greater the risk of the overseas COVID-19 input of category j on t-day, the greater the pressure of prevention and control from overseas input.The number '( & of new overseas imported cases in the nine categories on day t-day is considered as '( & independent tests, and each confirmed case imported from abroad is considered to be a test.The test result can only belong to one of the nine categories, the risk probability of the case belonging to category is 1 & and it is an evaluatable parameter.Vectors consisting of the parameters to be evaluated and data of new overseas imported cases in the nine categories on t-day are recorded as 7 & and 898 & , namely: By the definition of multinomail distribution, we get ~( , ) data M oe θ and the likelihood function is: where . According to proposals such as NguyenX (2016) [20], the Dirichlet distribution is used as a priori distribution, and the number of provinces and cities contained in each category in the clustering results is used a priori information to select prior parameters, namely, M 12,3,2,2,2,1,1,1,1 : , The priori density is , and we obtain the post-test exact distribution of parameter 7 & : and the corresponding post-test density function is By ( 2) we can get the full condition distribution of nine parameters, for example, in the case of 7 & , the corresponding full-condition distribution is  3) is a standard distribution, the other eight full-condition distributions are also beta distribution.We can use Gibbs sampling to obtain all the parameters (a total of days, 9 categories per day, 52 × 9 = 468 parameters) of the post-test sample, and complete Bayesian inference.

Solution of the Model
The solving of the model is completed in the OpenBUGS software environment.After seeding random numbers, the software automatically generates the initial value of parameters.In order to reduce the self-correlation among the post-test samples and to ensure the convergence of the MC chain, the sampling step of the sampling interval is 100.The orderly loosening algorithm [21] is used to eliminate random walking in the MCMC.After 10 iterations, posterrior samples (468 MC chains) of 468 parameters to be evaluated was obtained.After each chain throws away the first 4999 samples of the burnin period, the parameters 7 are inferred by the MC method using the remaining samples.The corresponding parameter estimates are shown in Table 1 (9 categories, 52 values per category).Table includes mean estimates for each parameter (Mean), median estimates (Median), 95% confidence interval CI: (2.5% ql, 97.5%qu), standard deviation (SD), and MC error (MCerror).See table 2 for the MC errors of category I to IX overseas input with greatest risk and the time point of occurrence.Each has a maximum MC error of less than 5.56×10 -4 .The maximum of (max.MCerror/sd)*100%is 0.677% (far smaller than 10%), which means that the model (2) has high precision [22].

Diagnosis of Model
For the MC chain of 468 parameters in model ( 2), the History-strace-plot plot, the density estimation graph, and the Autocorrelation graph are drawn respectively (in the case of 1 , see Figure 1, the remaining 467 plots are omitted, the same below).The History-strace-plots of 468 parameters show that after discarding the first 4999 burnin values, the MC chain of 468 parameters converge and each limit distribution is their own posterior distribution.The autocorred graph of each MC chain shows that after the lag period ≥2, the autocorrelation coefficient is close to 0, and each MC chain can be regarded as a MC chain of the independent and identically distribution.Statistical inferences can be made using the corresponding MC chain as a posterior sample (see Figure 1).the category I 1 &, presents symmetrical distribution characteristics (it is also proved by 52 box diagrams of the first category in Figure 2), and the probability of the other nine categories presents a more severe right-biased distribution, with mean estimates being more affected by extreme values.Their robust estimates (median value) and corresponding 95% confidence interval are taken as the risk probabilities of the nine categories for discussion.

Interpretation of Result
The 52-day 2020-3-3 to 4-23 days are considered to be 52 time points, with 0 new cases of overseas input in nine categories with many time points.The frequency (proportion) of overseas input is calculated with sample data, and the input frequency of the corresponding time point is 0 or 1.However, this does not mean that the risk of overseas input in these points is 0 or that the risk of overseas input is 100%.According to probability theory, the estimation error is large at these time points corresponding to these extreme values [23].The calculation frequency of each of the five time points (2020-3-9, 3-11, 3-12, and 4-20) in Table 2 is 0, which is also explained by the maximum MC error (the second column in Table 2).Using the model (2) calculated by the nine categories of overseas input risk (converted to percentages).The nine categories show the risk change characteristics of four stages.In order to clearly express the change characteristics of each stage, the four categories (G1, G2, G4, G6) with large overseas input risks and the remaining five categories are respectively plotted as point and line graphs.The statistical values of each category are calculated (Table 3).The results show that: (1) In the first stage (2020-3-3 (No. 1) to 3-21 (No. 21)), except for the rapid rise of category G1 and the rapid decline of category G7, other types of input risks decline slowly.(2) In the second stage (2020-3-22 (No. 22) to 4-7 (No. 39)), except for the rapid decline of category G1 and the rapid rise of category G4, other types of input risks rose slowly and gradually declined after seven days.(3)In the third stage (2020-4-8 (No. 40) to 4-18 (No. 49)), the G4 category of overseas input risks declined rapidly, while other types of input risks showed an obvious upward trend.(4) In the fourth stage (after 4-18 (No. 50)), the input risks of each category fluctuate steadily (Figure 3).The overseas input risk probability of nine categories is between 0 and 1, which indicates that the overseas input risk probability obtained by model ( 2) can truly reflect the spatial distribution of overseas input cases, which is more practical than the frequency to describe.
In order to intuitively display these characteristics on the map, to calculate the average of the probability estimates (median values) of overseas input risks in four periods as the representative values of this period (see Figure 3), and use Adobe Illustrator software to draw the time-space change chart of overseas input risks (see Figure 4).The comprehensive analysis combined with the input risk estimates (Figure 3) shows that: (1) the input frequency is 0, and there is still overseas input risk.(2) In the long run, the highest risk area is still the first category of 12 provinces and cities represented by Beijing, Shanghai and Guangdong, including 10 coastal/border provinces and cities.There are 7 days in the study period, and the risk value of the fourth category (Heilongjiang and Shanxi) is higher than that of the first category.(2) Before March 21, 2020 (the first stage), the first type of overseas input risk is the highest (59.613%).( 3) From March 22 to April 18, 2020 (the second and third stages), the top two overseas input risks are both Category I and Category IV.However, the first category gradually decreased (from 60.50% to 37.00%), and the fourth category gradually increased (from 16.1% to 33.8%).( 4) After April 19, 2020, the highest input risks are the first category, the third category (Shaanxi, Jilin), and the fourth category.The overseas input risk of other six categories are less than 5%.For the sixth, seventh, and eighth category, their overseas input risks of are stable around 2%. (5) The last two categories of overseas input risks are the eighth category (Henan) and ninth category (Anhui).( 6) The fourth category has the largest fluctuation in overseas input risk.

Conclusion and Prospect
Using two clustering methods (Average) and similar (Mcquitty), the 25 provinces and cities with overseas inputs were clustered with indicators, resulting in nine distinct categories.The 12 provinces and cities represented by Beijing, Shanghai and Guangdong are the first category, accounting for 57.42% of the total number of imported cases from abroad.The fourth category Heilongjiang and Shanxi have 12 days of new overseas imported cases more than 11 people per day, and the total proportion of overseas imported cases are 27.86%.Shaanxi and Jilin are the third category, Inner Mongolia is the sixth category, and the total proportion of imported cases abroad are 3.07% and 2.95% respectively.there are more than 98.69% of foreign imported cases in 52 days in above-mentioned 4 categories, 17 provinces and cities.Based on the clustering results, a multinomial distribution under the Bayesian framework is established.The model parameters represent the probability of overseas input risk of 9 categories every day within 52 days.The MCMC method is used to obtain the MC chain of 468 parameters, and the limit distribution is their own posterior distribution, which is suitable for statistical inference with the corresponding MC chain as posterior sample.During the study period, 25 provinces and cities have input risks.The highest risk zone of overseas input is still the first category of 12 provinces and cities represented by Beijing, Shanghai and Guangdong, including 10 coastal/border provinces and cities.During the study period, overseas input risk of the fourth category (Heilongjiang, Shanxi) is higher than the first category, and the input risk of this category fluctuates the most.With 2020-3-22, 4-7, 4-18 as the time nodes, the study period is divided into four stages.The higher risk of overseas input in the four stages is as follows: the first category in the first stage (59.613%);In the second and third stages, the first (from 60.51% to 37.06%), and the fourth category (from 16.07% to 33.85%);In the fourth stage, category I (42.62%), category III (17.56%) and category IV (10.06%).The fourth category has the largest fluctuation in overseas input risk.During the study period, overseas input risks of the eighth (Henan) and ninth (Anhui) categories have been in the last two places.
For overseas input case data with zero expansion, regional loss and large difference in data size.Bayesian multinomial distribution model based on clustering results achieves the estimation of overseas input risk, which is superior to the traditional frequency characterization method.The relevant research needs to be continued by colleagues.

Figure 1 .
Figure 1.Cluster tree of daily number of imported cases of 25 provinces or cities from March 3 to April 8, 2020.

Figure 2 .
Figure 2. Convergence diagnosis diagram.The upper is the history strace plot graph of posterior samples of 1 , the lower left is the curve graph of kernel density estimation of 1 , and the lower right is the autocorrelation graph of 1 .

Figure 3 .
Figure 3.The box plot of posterior sample of 1 to 52,1 p

Table 1 .
Summary of Bayesian estimation and some statistics of 468 parameters.
Note: Categories 6-8 are omitted from the table

Table 2 .
Summary of maximum MC error and corresponding information of each MC chain of model (2).nodeMax.