Kernel-Type Estimators of Divergence Measures and Its Strong Uniform Consistency
Hamza Dhaker1, *, Papa Ngom1, El Hadji Deme2, Pierre Mendy3
1Departement de Mathématiques et Informatique, Faculté des Sciences et Technique, Université Cheikh Anta Diop, Dakar, Sénégal
2Sciences Appliquées et Technologie, Unité de Formation et de Recherche, Université Gaston Berger, Saint-Louis, Sénégal
3Département de Techniques Quantitatives, Faculté des Sciences Economiques et de Gestion, Université Cheikh Anta Diop , Dakar, Sénégal
To cite this article:
Hamza Dhaker, Papa Ngom, El Hadji Deme, Pierre Mendy. Kernel-Type Estimators of Divergence Measures and Its Strong Uniform Consistency. American Journal of Theoretical and Applied Statistics. Vol. 5, No. 1, 2016, pp. 13-22. doi: 10.11648/j.ajtas.20160501.13
Abstract: Nonparametric density estimation, based on kernel-type estimators, is a very popular method in statistical research, especially when we want to model the probabilistic or stochastic structure of a data set. In this paper, we investigate the asymptotic confidence bands for the distribution with kernel-estimators for some types of divergence measures (Rényi-α and Tsallis-α divergence). Our aim is to use the method based on empirical process techniques, in order to derive some asymptotic results. Under different assumptions, we establish a variety of fundamental and theoretical properties, such as the strong consistency of an uniform-in-bandwidth of the divergence estimators. We further apply the previous results in simulated examples, including the kernel-type estimator for Hellinger, Bhattacharyya and Kullback-Leibler divergence, to illustrate this approach, and we show that that the method performs competitively.
Keywords: Divergence Measures, Kernel Estimation, Strong Uniform, Consistency
In this paper, we focus on the similarity between two distributions. Given a sample from one distribution, one of fundamental and classical question to ask is: how to have the similarity between this density with another known density? First, one must specify what it means for two distributions to be close, for which many different measures quantifying the degree of these distributions have been studied in the past. They are frequently called distance measures, although some of them are not strictly metrics. The divergence measures play an important role in statistical theory, especially in large theories of estimation and testing. They have been applied to different areas, such as medical image registration , classification and retrieval. There are several important problems in machine learning and statistics that require the estimation of the distance or divergence between distributions. Divergence between distributions also proves to be useful in neuroscience, For example (see, e.g ). employs divergence to quantify the difference between neural response patterns.
Later, many papers have appeared in the literature, where divergence or entropy type measures of information have been used in testing statistical hypotheses. For more examples and other possible applications of divergence measures, see the extended technical report [23,24]. The key role of the measure divergence in these various applications, it is necessary to accurately estimate these divergences.
Recently Ngom et all  has introduced the Divergence Indicator method by proposing a test for choosing between a random walk and AR(1), using a divergence measure.
The class of divergence measures is large; it includes the Rényi- [25, 26], Tsallis- , Kullback-Leibler (KL), Hellinger, Bhattacharyya, Euclidean divergences, etc. These divergence measures can be related to the Csiszár- divergence . The Kullback-Leibler, Hellinger and Bhattacharyya are special cases of Rényi- and Tsallis- divergences. But the Kullback-Leibler one is the most popular of these divergence measures. Estimation of divergence and its applications have been many studies using different approaches. For example Pardo  presented methods and applications in the case of discrete distributions. By exploring a nonparametric method for estimating the divergence in the continuous case, Poczos and Schneider  proposed a -nearest-neighbor estimator and proved the weak consistency of the estimator Rényi- and Tsallis- divergences.
Finding estimators nonparametric of measure divergence, remains an open issue. Krishnamurthy and Kandasamy  used an initial plug-in estimator for estimating by estimates of the higher order terms in the von Mises expansion of the divergence functional. In their frameworks, they proposed three estimators for Rényi-, Tsallis-, and Euclidean divergences between two continuous distributions and established the rates of convergence of these estimators. The main purpose of this paper is to analyze estimators for divergence measures between two continuous distributions. Our approach is similar to that of Krishnamurthy and Kandasamy  and is based on plug-in estimation scheme: first, we apply a consistent density estimator for the underlying densities, and then we plug them into the desired formulas. Unlike of their frameworks, we study the strong consistency estimators for a general class of divergence measures. We emphasize that the plug-in estimation techniques are heavily used by [2, 9] in the case of entropy. Bouzebda  proposed a method to establish consistency for kernel-type estimators of the differential entropy. We generalize this method for a large class of divergence measures in order to establish the consistency of kernel-type estimators of divergence measure when the bandwidth is allowed to range in a small interval which may decrease in length with the sample size. Our results will be immediately applicable to proving strong consistency for Kenel-type estimation of this class of divergence measures.
The rest of this paper is organized as follows: in Section 2, we introduce divergence measures and we construct their kernel-type estimators. In Section 3, we study the uniform strong consistency of the proposed estimators. Section 4 is devoted to the proofs. In Section 5, numical examples are proposed in order to illustrate the performance of our method. Finally, in Section 6, we present our conclusion.
2. Kernel-Type Estimators of Divergence Measures
In this section we give the notations and then presenting some basic definitions. We are interested with two densities,, : where denotes the dimension. The divergence measures of interest are Rényi-, Tsallis- are defined respectively as follows
These quantities are nonnegative, and equal zero iff almost surely (a.s). Remark that in the special cases where , we obtain from (1) and (2) the well known Hellinger, Kullback-Leibler and Bhattacharyya divergences.
which is related to the Shannon entropy. For some statistical properties for the Shannon entropy, one can refer to .
For the following, we focus only on the estimation of and . The Kullback-Leibler, Hellinger and Bhattacharyya can be deducted immediately.
We will next provide consistent estimator for the following quantity
whenever this integral is meaningful. Plugging its estimates into the appropriate formula immediately leads to a consistent estimator for the divergence measures and .
Now, assuming that for the rest of the document, the density is unknown, and the density is known and satisfies: is finite, this implies that is finite. Next, consider a sequence of independent and identically distributed -valued random vectors, with cumulative distribution function a density function with respect to the Lebesgue measure on . The following conditions are needed for the remainder of this paper. To construct our divergence estimators we define, We start by a kernel density estimator for , and then substituting by its estimator in the divergence like functional of . For this, we introduce a measurable function that satisfies the following conditions.
(K.1) is of bounded variation on
(K.2) is right continuous on
where is the bandwidth sequence. Assuming that the density is continuous, one obtains a strongly consistent estimator of , that is, one has with probability , , . There are also results concerning uniform convergence and convergence rates. For proving such results one usually writes the difference as the sum of a probabilistic term and a deterministic term , also called bias. For further explanation One can refer to [10, 12, 13], among other authors. After having estimated , we estimate by setting
where and is a sequence of positive constants. Thus, using (5), the associated divergences and can be estimated by:
The approach used to define the plug-in estimators is also developed in  in order to introduce a kernel-type estimator of Shannon’s entropy. Some statistical properties of these divergences is related to those of the kernel estimator of the continuous density . The limiting behavior of , for appropriate choices of the bandwidth , has been widely studied in the literature, examples include the work of Deroye [6, 7] Bosq  and Prakasa . In particular, under our assumptions, the condition that together with is necessary and sufficient for the convergence in probability of towards the limit , independently of and the density . We can find other results of uniform consistency of the estimator in [4, 10, 5] and the references therein. In the next section, we will use the methods developed in previous references to establish convergence results for the estimates and deduce the convergence results of and .
3. Statistical Properties of the Estimators
We first study the strong consistency of the estimator defined in (5). Throughout the remainder of this paper, we well the notation for , which is delicate to handle. This is given by
Lemma 1 Let satisfy (K.1-2-3-4) and let be a continuous bounded density. Then, for each pair of sequence , such that , together with , as and for any one has with probability 1
The proof of Lemma 1 is postponed until Section 4.
Lemma 2 Let satisfy (K.3-4) and let be a uniformly Lipschitz and continuous density. Then, for each pair of sequence , such that , together with , as and for any we have
The proof of Lemma 2 is postponed until Section 4.
Theorem 1 Let satisfy (K.1-2-3-4) and let be a uniformly Lipschitz, bounded and continuous density. Then, for each pair of sequence , such that , together with , as and for any one has with probability 1
This, in turn, implies that
The proof of Theorem 1 is postponed until Section 4.
The following corollaries handle respectively the uniform deviation of the estimate and with respect to and .
Corollary 1 Assuming that the assumptions in Theorem 1 hold. Then, we have
This, in turn, implies that
The proof of Corollary 1 is postponed until Section 4.
Corollary 2 Assuming that the assumptions in Theorem 1 hold. Then, we have
This, in turn, implies that
The proof of Corollary 2 is postponed until Section 4.
Note that, the divergence estimator such as (5) also requires the appropriate choice of the smoothing parameter . The results given in (6), (7) and (8) show that any choice of between and ensures the strong consistency of the underlying divergence estimators. In other words, the fluctuation of the bandwidth in a small interval do not affect the consistency of the nonparametric estimators of these divergences. The work of Bouzebda and Elhattab  is very important for establishing our results, these authors have created a class of compactly supported densities. They used the following additional conditions.
(F.1) has a compact support say and is -time continuously differentiable, and there exists a constant such that
(K.5) is of order , i.e., for some constant
Under (F.1) the expression may be written as follows
Theorem 2 Assuming conditions (K.1-2-3-4-5) hold. Let fulfill (F.1). Then for each pair of sequences with , as and for any , we have
The proof of Theorem 2 is postponed until Section 4.
Corollary 3 Assuming that the assumptions in the Theorem 2 hold. Then,
Corollary 4 Assuming that the assumptions in the Theorem 2 hold. Then, for any we have
The proof of Corollaries 3 and 4 are given in Section 4.
Now, assume that there exists a sequence of strictly nondecreasing compact subsets of , such that For the estimation of the support we may refer to () and the references therein. Throughout, we let , where and are as in Corollaries (3) and (4). We choose an estimator of in the Corollaries (3) and (4) as the form
Using the techniques developed in  and the Corollaries (3) and (4) one can construct an asymptotically certainty intervals for the true divergences and .
4. Proofs of Our Results
Proof of Lemma 1. to show the strong consistency of , we use the following expression
where and is a sequence of positive constant. Define
Since is a 1-Lipschitz function, for then
Therefore for , we have
where denotes, as usual, the supremum norm, i.e., . Hence,
Using the conditions on the kernel , posed by Einmahl , consider the class of functions
For , set , where the supremum is taken over all probability measures on , where represents the -field of Borel sets of , i.e is the smallest containing all the open (and/or closed) balls is . Here, denotes the -metric and is the minimal number of balls of -radius needed to cover .
We assume that satisfies the following uniform entropy condition.
(K.6) for some and ,
(K.7) is a pointwise measurable class, that is there exists a countable sub-class of suchthat we can find for any function a sequence of functions in for which
This condition is discussed in . It is satisfied whenever is right continuous.
Remark that condition (K.6) is satisfied whenever (K.1) holds, i.e., is of bounded variation on , we refer the reader to Van der Vaart and Wellner , for details on conditions of entropy (see also Pakes and Pollard , and Nolan and Pollard ). Condition (K.7) is satisfied whenever (K.2) holds, i.e., is right continuous, this condition is discussed in , (see also  and ).
From Theorem 1 in , whenever is measurable and satisfies (K.3-4-6-7), and when is bounded, we have for each pair of sequence , such that , together with and as , with probability 1
Since , in view of (11) and (12), we obtain with probability 1.
It concludes the proof of the lemma.
Proof of Lemma 2.
Let be the complement of in (i.e, ). We have
Term . Repeat the arguments above in the terms with the formal change of by . We show that, for any ,
On the other hand, we know (see, e.g, ), that since the density is uniformly Lipschitz and continuous, we have for each sequences , with , as ,
Term . It is obsious to see that
Thus, in view of (16), we get
Finaly, in view of (17) and (20), we get
is deduced the proof of the lemma.
Proof of Theorem 1. We have
Combinating the Lemmas (1) and (2), we obtain
It concludes the proof of the Theorem.
Proof of Corollary 1. Remark that
Using the Theorem (1), we have
and the Corollary 1 holds.
Proof of Corollary 2. A first order taylor expansion of arround and gives
Remark that from Theorem 1,
which in turn, implies that
Thus, for all
and the Corollary 2 holds.
Proof of Theorem 2. Under conditions , and using Taylor expansion of order we get, for ,
where and Thus a straightforward application of Lebesgue dominated convergence theorem gives, for large enough,
Set, for all ,
by combining (22) and (24)
Let be a sequence of nondecreasing nonempty compact subsets of the interior of such that
Now, from (25), it is straightforward to observe that
The proof of Theorem 2 is completed.
Proof of Corollary 3. A direct application of the Theorem 2 leeds to the Corollary 3.
Proof of Corollary 4. Here again, set, for all ,
A first order Taylor expansion of leads to
Using condition , is compactly supported), is bounded away from zero on its support, thus, we have for large enough, there exists , such that , for all in the support of . From (23), we have
by combining the last equation with (22)
The proof of Corollary is completed.
5. Simulation Study
Summarizing the ideas and the results given in the previous sections, we propose to study the performance of the kernel-estimators for Hellinger (-), Bhattacharyya (–) and Kullback-Leibler (–) measures and their uniform-in-bandwidth consistency.
Hellinger, Bhattacharyya and Kullback-Leibler divergences are defined respectively as follows:
The asymptotic behavior of each bandwidth is performed using the kernel-type estimator of the divergence criteria in corollary 3 and corollary 4 respectively.
We compute, for each chosen value of α, the expressions
where the cooresponding bounds () are defined by
We consider an experiment in which the DGP (Data Generating Process) for the true distribution are generated from a mixture of two normal distributions,
and the function is supposed to be a normal distribution with mean 1 and variance 2.
The sample size varies from 10 to 1000, and for each size, the statistics andare evaluated.
In order to plot against sample size, we need to perform three sets experiments.
The results are presented in tables 1-3 and figures 1-3.
The tables 1-3 show that the kernel-type estimators of the divergence measures converge rapidly to their pseudo-true value, and confirm our asymptotic results. They all show that the discrepancy between the estimated and the true divergence criterion converge rapidly to zero. Similarly, in table 2 and table 3, DB and DK converge, as expected, to zero, which is the mean of the asymptotic distribution when the estimated distribution is close to f.
The figures 1-3 show values plots for Hellinger, Bhattacharyya and Kullback-Leibler divergence respectively. The preceding comments from the table 1-3 also apply to the figure 1-3. For dealing with divergence error, it is much revealing to graph DH, DB and DK against sample size. They also confirm our asymptotic results. We note that, as sample size increases, the value discrepancy plots of divergence error converge, as it should, to zero These plots provide a great deal of information about how the sample size affect the performance of these informational criterions.
6. Concluding Remarks and Future Works
In this paper, we are concerned with the problem of nonparametric estimation of a class of divergence measures. For this cause, many estimators are available. The most recent ones are the estimates developed by Bouzebda . We introduce an estimator that can be seen as a generalization of those previously suggested, in the sense that Bouzebda was only interested in the case of entropy, while we focus on the Rényi- and the Tsallis- divergence measures. Under our study, one can easily deduce Kullback-Leibler, Hellinger and Bhattacharyya nonparametric estimators. The results presented in this work are general, since the required conditions are fulfilled by a large class of densities. We mention that the estimator in (5) can be calculated by using a Monte-Carlo method under a given distribution . And a practical choice of is where and .
It will be interesting to enrich our results presented here by an additional uniformity in term of in the supremum appearing in all our theorems, which requires non trivial mathematics, this would go well beyond the scope of the present paper. Another direction of research is to obtain results, in the case where the continuous distributions and are both unknown. The problems and the methods described here are all inherently univariate. A natural and useful multivariate extension appear in the use of copula function.