Estimating Population Total Using Machine Learning Logistic Regression: COVID-19 Pandemic Challenges Perspective

The estimation of the population total in undeveloped and developing countries in the recent past has attracted a lot of interest to many researchers due to the sole purpose of planning resource allocation, personnel training and infrastructure in social, health, transport, communication and education. The comprehensive census survey in many countries are conducted every ten years but the government administration changes in many counties every four to five years due to the limit of government terms as per the constitution and therefore does not coincide with the time of census survey. Further, due to the emerging COVID-19 pandemic challenges that requires ministry of health protocols of social distance, the census survey in which the methods of questionnaire and personal interview are commonly used need to be avoided and therefore there is need to search for a better and reliable estimating models for estimating the population total which is the main focus of the study. The existing and developed methods of exponential and logistic class of population total estimating modes have been considered and compared. The main problem in the logistic models in estimating the population total is the estimation of the highest possible population that can be attained for each of the administrative units. In this study a machine learning logistic regression has been proposed and incorporated to search and estimate the constant using the supervised learning process. The performance of the methods have been compared using the Root Mean Square Error (RMSE) whose values were recorded as 1.062, 1.524, 0.477, 0.819 and 0.286 for the exponential, logistic I, Logistic II, logistic III and machine learning logistic (logistic IV) in which the proposed model performed better with the least square error value of 0.286. The proposed model was then used to project the population total and projected the population total for all regions as 51.00, 55.02, 62.50, 69.10, 74.65 and 79.14 in millions in the years 2024, 2029, 2039, 2049, 2059 and 2069 respectively.


Introduction
The determination and estimation of population total in undeveloped and developing countries is important for governments to be able to plan for development projects as promised and documented in the manifesto during national elections campaigns. The requirement for the population census for development as set out by the United Nations is to have a detailed population and housing census every ten years [1]. In Kenya for instance the elections are conducted every five years while the Kenya Population and Housing Census is conducted every ten years, this therefore means that estimating the statistics that includes the population total for this purpose is a priority in departments of Economic Planning and National Bureau of Statistics due to the key role that the statistics from the census survey plays in resource allocation for economic growth of countries [2,3]. Since most of the data in census survey are collected using questionnaires and personal interviews, this pose danger of COVID-19 transmission as keeping the social distance and avoiding sharing of census survey materials will be hard to achieve. The development of better strategy and models for estimating the population total will go a long way in boosting the confidence of governments in obtaining better, reliable, cheaper and timely statistics especially during the time of COVID-19 pandemic as collecting such data is limited due to the health protocols as advised by ministry of health officials and experts.

Statement of Problem
In planning for development, the census survey need to be carried to obtain the necessary statistics as a basis for resource allocation for the government projects. The collection of such statistics is collected after every ten years as it is expensive for most governments to conduct the census survey in shorter durations of time. In order to have reliable statistics especially when there is a change of government there is need to have timely and reliable data for the planning departments and this can only be achieved through estimates as conducting the census survey requires capital, equipment, well trained personnel and approvals which is only feasible and affordable after a longer duration of time in order to give amble time to source for the funds from partners, train personnel and procure equipment for the census survey. In the recent past an attempt has been made to achieve this requirement but the problem of estimating the optimum population that can be attained (constant k) in the logistic models has been a great challenge as the value is invalid in many populations. In this study therefore the focus will be to develop machine learning logistic regression technique that will search better strategy, estimate parameters and determine the projected population totals for the Kenya administrative regions.

General Objective
The general objective of the research project is to develop machine learning logistic nonlinear regression model for estimating the interpolation and projected Kenya population total for the various administrative regions for development.

Specific Objectives
(i) Study the characteristics of the Kenya population using the population pyramid (ii) Study the characteristics of the Kenya population using the population lifetables model. (iii) Develop machine learning nonlinear logistic regression model for estimating the population total for various regions in Kenya. (iv) Determine the parameters of the proposed logistic regression model. (v) Compare the performance of the developed logistic regression model with other already existing logistic regression models. (vi) Determine the interpolation and projected population total for existing administrative units using the proposed developed logistic regression model.

Hypothesis
The hypothesis in this research project is important as it will be used to guide in determining whether the estimating population total model is a better model compared to other models in the same class that is simply given and stated as machine learning logistic regression model is better nonlinear model in estimating population total. However the heuristic approach to the measure of performance of the model is to consider the Root Mean Square Error (RMSE) measure due to the large number of models being considered.

Justification of the Study
The research project is a crucial and important study due to the contribution it makes in the development as the estimation of population total for planning in undeveloped and developing countries taking into consideration the COVID-19 pandemic challenges as the census surveys and other forms of surveys are not feasible and having reliable estimation models will fill the emerging gap as the government projects need to be budgeted and allocated resources and require the statistics for fair distribution of the available resources.

Literature Review
An attempt has been made by various researchers in modeling population total for instance investigation on population growth using exponential and hyperbolic modeling for the world population and estimated that the population is estimated to reach 100 billion in the year 2172 [4]. A study conducted by Gotelli has considered the geometric and exponential models and further considered incorporating the deaths and births [5]. Another study conducted by Kabareh et al considered the estimation of the population total using the birth and death process [6]. However, due to the unavailability of births and deaths data in most of the administrative units due to the beliefs, religion and nonresponse on the vital statistics, the estimation using such information is possible in developed countries where the records are available and complete.
In estimating the Kenya population total for the year 2019, the study by Kabareh et al estimated the population total using the Lagrange polynomial and estimated the value of forty eight thousand five hundred thirty three thousand five hundred eighty seven that is close but not good enough to the actual census survey value of forty seven thousand five hundred sixty for thousand two hundred ninety six [7]. The study on population total has also been investigated by Kabareh et al in which the piecewise polynomial approximation to the Newton backward difference polynomial approximation of the finite population total considered and results compared [8]. The results revealed that the piecewise polynomial is a better predicting model than the Newton backward polynomial. The logistic model has been considered in bounded population by Kabareh et al in which they considered the bounded population total using linear regression in the presence of auxiliary information [9,10]. The results and conclusions from the past studies leaves the problem an open area for further investigation and in this study the investigation is intended to explore and develop a better model for estimating the population total.

Methodology
The methodology for the study problem considers the existing models for the estimation of population total that include the exponential and class of logistics models are considered, derived and discussed. The proposed machine learning logistic regression model will be developed and estimation of the parameters discussed.

Exponential Model
In the exponential model also referred to as logistic model by some authors makes the assumption that the population grows at a rate that is proportional to the population size, that is, in each unit of time, a certain percentage of the individuals produce new individuals. If the reproduction takes place more or less continuously, then this growth is represented by dp dt = rp where p is the population as a function of t, and r is the proportionality constant [11]. The differential equation for the logistic model can be solved by separate the variables and integrate both sides such that dp p ∞ -∞ = rdt ∞ -∞ then lnp = rt + c such that P t = e rt + c or P t = c 1 e rt Let t = 0 then p 0 = c 1 such that P t = P 0 e rt (1)

Logistic I model
The logistic I model was originally developed by Verhulst and later studied by Pearl, R. and Read, L [12]. The curve in its simplest form takes the form given as Such that t is the year for which the value has to be interpolated, y 0 is geometric mean of the first three years of the series, y 1 is geometric mean of the middle three years of the series, y 2 is geometric mean of the last three years of the series and n is the duration between successive censuses.

Logistic II Method
The logistic II method is a derivation from logistic I method in which we take into consideration the case of populations N 1 , N 2 , N 3 recorded at equidistance times of t 1 , t 2 and 2t 2 -t 1 . Further let us now take N 1 = y o , N 2 = y 1 and N 3 = y 2 such that k = 2y 0 y 1 y 2 -y 1 2 (y 0 + y 2 ) y 0 y 2 -y 1 The constant a is the y-intercept in the plot of the graph of log(k/p -1) against t, that is at t = 0 a = log k -p 0 p 0 In determining the constant b, plot the graph of log(k/p -1) against t and the value of b will be the gradient/slope

Logistic III Model
In logistic III model, consider the logistic form in logistic II. Now if we let y = 10 such that lny = ln10 then y = e ln10 [13]. The logistic equation can now be written in the form Since b is the rate of growth, then the logistic III method take the form where r is the growth rate.

Machine Learning Logistic Regression (Logistic IV) Model
In the presented and discussed models of logistic I, Logistic II and logistic III the problem has always been the determination of the highest population that can be attained. In quite a number of populations the value is invalid or null and therefore with such an invalid value in the logistic equation, the estimation of the population total is not possible. In this study we have proposed and developed the machine learning logistic regression method in which the search of the constant, estimation of parameters and determination of the population totals is carried out using an algorithm. The development of the algorithm is as illustrated and depicted in Figure 1.
The steps depicted in Figure 1 are translated and written into a complete step-by-step procedure commonly referred to as algorithm as written in Algorithm I below [14]. The steps developed for computing the highest population attained, population estimates and projections of populations that are unambiguous in a finite number of steps represent the proposed machine learning logistic regression model that finally is written as a program code in R or MATLAB that is used to compute for the parameters and projection of the population total estimates for the administrative units. Algorithm 1: Machine learning logistic regression estimation of parameters 1. Begin 2. Declare variables i = 1, n, Region, N 1 , N 2 , N 3 , k, a, b, r, t h , E 1 , E 2 , E 3 , P 0t , P 1t , P 2t , P 3t 3. Input Region, N 1 , N 2 , N 3, P 0t 4. Compute maximum population that can be reached, k, i = i + 1 5. If k < N 3 update N 2 = 0.5 (N 1 + N 3 ) and k 6. Compute a, b, r, t 0 , P 1t , P 2t , E 1 , E 2 , 7. If E 1 < E 2 D = 1 P 3t , = P 1t , E 3 = E 1 else D = 0, P 3t = P 2t , E 3 = E 2 8. Display Region, k, a, b, E 1 , P 1t , r, t 0 , E 2 , P 2t , D, E 3 , P 3t 9. If I <= N Repeat from step 3 The projection of the population would now be estimated using the new developed machine learning logistic regression model. The steps that have been agreed on after a series of simulation and testing are for the projection estimates are presented in a pictorial representation in Figure 2. The steps are then translated and written as in Algorithm II that could COVID-19 Pandemic Challenges Perspective be used to write R or MATLAB code for determining the projected population total 2024 -2069.

Analysis and Discussion of Results
In analysis and discussion of the results the logistic models are considered in estimating the population total of the smaller units of the Kenya population. In understanding the characteristics of the population that will help in explaining population dynamics the population pyramid and abridged life table are constructed for the population under consideration.

Population Pyramid
The age-sex pyramid of the population is the classification of the age structure by sex in the form of a histogram such that the base of the pyramid shows the lower age starting from zero while the top represents higher ages to a maximum of 89 years with age intervals of five years [15]. The Kenya National Population Census conducted in 2019 recorded the distribution of population by age and sex as recorded in Table  1 and age-sex pyramid was constructed as in Figure 2 [16].  The pyramid in Figure 3 reflects the conditions of age-sex pyramid of developing countries whose main feature is a wide base. This is interpreted by demographers, economists and statisticians as marked decline in the death rate because the population is becoming younger and this is true description of Kenya as it is classified as a developing country as per the united Nations [17].

Life Table
This is an important technique in the field of health that characterizes the well-being of a population and has applications in insurance companies to predict how long an individual will live when determining premiums for persons who take life insurance policies and further used to predict the compensation payments to nuclear family members in case of disability or loss bread winner's life and therefore not economically active [18]. The life tables identify death rates experienced by a population over a given duration of time and in the recent past have been used make comparisons and estimate the biological limit to life. An initial population of 100, 000 new born for the Kenya population had the mortality experience monitored and the abridged life table has been prepared and summarized in Table 2 [19]. The life expectancy of the Kenya population has been determined and approximately given as 57.7 years with birth rate or death rate in the stationary population of 0.017, that is, 1700 per 100 000.

Estimation of the Logistic Models Parameters
In the estimation of the population totals the exponential, logistic I, logistic II, logistic III and logistic IV models will be considered and the performance of the models compared to determine a better performing model for estimating population total. The Kenya population was considered for the study that comprise of ten administrative units such that each region includes more than one county except Nairobi. The Kenya population and housing census survey recorded the population totals in the ten regions as recorded in Table 3 in the duration 1969 -2019 [16]. The estimated parameters of the models have been determined and recorded in Table 4 for logistic I and Table 5 for exponential, logistic II, logistic III and logistic IV models. The logistic I model recorded lower estimated population that would be reached in all the regions except Rift Valley U, Nyanza and Nairobi. In Nyanza the recorded value of 114.54 million is the highest recorded value in all the regions and the value recorded is extremely high, which is an indication of weakness in the model. The logistic I model further recorded a very low value of the constant k in North Eastern region which is also is also an indication of flaws in the model for estimating population total. However, further performance test is required to support this claim.
The estimated population totals for the ten regions in 2019 were determined and recorded as in Table 6. The exponential model recorded the highest estimated population total followed by logistic I, logistic III and logistic II model. The performance of the models is determined using the root mean square (RMSE) that is given as The Root Mean Square Error (RMSE) recorded for exponential, logistic I, logistic II and logistic III were recorded as in Table 7. The logistic I model recorded the highest RMSE followed by exponential, logistic III and logistic II with values of 1.524, 1.062, 0.819 and 0.477 respectively. In the search for the parameters of logistic IV using the machine learning, first, take into consideration the two models that have better estimates using the RMSE measure that are selected as logistic II and logistic III models. Secondly, create a dummy variable that will be used to select the parameters using the classification or discriminant criterion that forms the whole process of developing the logistic IV model [20]. Consider a random variable d i that takes the value 1 if logistic II model has lower square error and therefore its parameters are used for estimating the population total otherwise takes the value 0 if the square error is larger such that the logistic III model is used for estimating the population total [21]. It is observed that the logistic II model will be used in estimating the population totals for the North Eastern, Central and Rift Valley L while the logistic III method will be used in estimating the population total for the remaining regions. The estimated population totals for logistic IV model have been recorded in Table 6 while the Square Error (SE) and Root Mean Square Errors have been recorded in Table 7. It is observed that logistic IV model records the lowest population total estimate of 49.047 and lowest RMSE of 0.286 which is an indication that the developed logistic model performance is a better estimating model. The Square Errors of all regions for the models are depicted in Figure 4 in which exponential and logistic I records high Square Error values compared to logistic II, logistic III and logistic IV models. Since the logistic IV model has better performance compared to all the other models, it will be given higher priority in estimating the population total for the regions. The first step in defining the model is to estimate the parameters that have been estimated and recorded as in Table 8. The choice of the model and parameters are believed to bridge the gap of obtaining better population total estimates for development.

Population Projection for Regions in Kenya 2024 -2069
The estimation of population total in the future is important for economic and infrastructure planning for governments to keep the phase of population growth and economic growth. The projection of the population in all regions have been determined and recorded as in Table 9 using the developed logistic IV model. It is projected that the population total for the nation will be 55.02, 62.50, 69.10, 74.65 and 79.14 million in 2029, 2039, 2049, 2059 and 2069 respectively.   The projected population for 2029 has been represented on the Kenya map to show the aggregate population for the ten regions as shown in Figure 5. The Eastern region has highest projected population, Rift valley U is the second highest, Nyanza the third highest while Rift Valley L has the least projected population. The projected Kenya population in 2029 -2069 for the ten regions have been computed and presented in Figure 6. The regions have recorded mixed projection where the regions North Eastern, Rift Valley lower and Western recorded lower increase of population total projections over the duration under consideration while the regions of Coast, Eastern, Rift Valley upper recoded higher increase of the population total projections in the same duration.

Conclusion
In this research project, the machine learning logistic regression model has been developed that may be used for estimating population total. The machine learning regression model that is referred to as logistic IV model recorded a better performance when compared to the other models in the same class. When using the RMSE a value of 0.286 was recorded that is lower than the other models that recorded 1.062, 1.524, 0.477 and 0.819 for the exponential, logistic I, logistic II and logistic III models respectively. In estimating the population in the future, the machine learning logistic model projects that the current population total of all the ten regions of 47.564 million will nearly double in 40 years, that is, the projected population total will be 74.655 million in 2059.

Recommendation
The developed machine learning logistic regression model would be recommended to be used in estimation of the population total for smaller administrative units of countries for instance the forty seven counties in Kenya that have not fully developed the statistics database for planning.

Further Research
In improving the machine learning logistic regression model an estimation of the parameters for every projection, that is, before the next projection, the parameters (growth rate, time to reach half-way highest population possible and highest possible population size possible) are updated after every interpolation or projection of the population. Further, consideration of estimating the population totals for smaller administrative units for instance the counties, districts or divisions probably will lead to an overall improvement of the estimated population totals.