Kernel-type estimators of divergence measures and its strong uniform consistency

In this paper, we develop a kernel-type estimators of divergence measures for continuous distributions. We use the method based on empirical process techniques for consistence kernel-type function estimators to show a general result for the strong uniform consistency of our proposal divergence estimators.


Introduction
Given a samples from two distributions, one fundamental and classical question to ask is: how close are the two distributions? First, one must specify what it means for two distributions to be close, for which many different measures quantifying the degree of these distributions have been studied in the past. They are frequently called distance measures, although some of them are not strictly metrics. The divergence measures play an important role in statistical theory, especially in large theories of estimation and testing. They have been applied to different areas, such as medical image registration ( [28]), classification and retrieval. In machine learning, it is often convenient to view training data as a set of distributions and use divergence measures to estimate dissimilarity between examples. This idea has been used in neuroscience, where the neural response pattern of an individual is modeled as a distribution, and divergence measures is used to compare responses across subjects (see, e.g [22]). Later many papers have appeared in the literature, where divergence or entropy type measures of information have been used in testing statistical hypotheses. For more examples and other possible applications of divergence measures, see the extended technical report ( [30,31]). For these applications and others, it is crucial to accurately estimate divergences. The class of divergence measures is large; it includes the Rényi-α ( [32,33]), Tsallis-α ( [38]), Kullback-Leibler (KL), Hellinger, Bhattacharyya, Euclidean divergences, etc. These divergence measures can be related to the Csiszárf divergence ( [5]). The Kullback-Leibler, Hellinger and Bhattacharyya are special cases of Rényi-α and Tsallis-α divergences. But the Kullback Leibler one is the most popular of these divergence measures. Estimation of divergence and its applications have been many studies using different approaches and specific. For example Pardo [27] presented methods and applications in the context of discrete distributions. By exploring a nonparametric method for estimating the divergence in the continuous case, Poczos and Schneider [30] use a k-nearest-neighbor estimator and show that one does not need a consistent density estimator to consistently estimate Rényi-α and Tsallis-α divergences.
In the nonparametric setting, a number of authors have proposed various estimators which are provably consistent. Krishnamurthy and Kandasamy [23] used an initial plug-in estimator for estimating by estimates of the higher order terms in the von Mises expansion of the divergence functional. In their frameworks, they proposed tree estimators for Rényi-α, Tsallis-α, and Euclidean divergences between two continuous distributions and established the rates of convergence of these estimators.
The main purpose of this paper is to analyze estimators for divergence measures between two continuous distributions. Our approach is similar on those of Krishnamurthy and Kandasamy [23] and is based on plug-in estimation scheme: first, apply a consistent density estimator for the underlying densities, and then plug them into the desired formulas. Unlike of their frameworks, we study the strong consistence estimators of a general class of divergence measures. We emphasize that the plug-in estimation technique are heavily used by [3,14] in the case of entropy. Bouzebda [3] propose a method to establish consistency for kernel-type estimators of the differential entropy. We generalize this method for a large class of divergence measures in order to establish the consistency of kernel-type estimators of divergence measure when the bandwidth is allowed to range in a small interval which may decrease in length with the sample size. Our results will be immediately applicable to proving strong consistency for Kenel-type estimation of this class of divergence measures.
The rest of this paper is organized as follows: in Section 2, we introduce divergence measures and we construct their kernel-type estimators. In Section 3, we study the uniform strong consistency of the proposal estimators. Section 4 is devoted on the proofs.

Kenel-type estimators of Divergence Measures
Let us begin by standardizing notation and presenting some basic definitions. We will be concerned with two densities, f, g : R d → [0, 1] where d ≥ 1 denotes the dimension. The divergence measures of interest are Rényi-α, Tsallis-α are defined respectively as follows These quantities are nonnegative, and they are zero iff f = g almost surely (a.s). Remark that In the special cases for α = 1/2, 1, we obtain from (1) and (2) the well known Hellinger, Kullback and Bhattacharyya.
which is related to the Shannon entropy. For some statistical properties for the Shannon entropy, one can refer to [3].
For the following, we focus only on the estimation of D T α ( f, g) and D R α ( f, g). The Kullback-Leibler, Hellinger and Bhattacharyya can be deducing immediately. We will next provide consistent estimator for the following quantity whenever this integral is meaningful. Plugging it estimates into the appropriate formula immediately leads to consistent estimator for the divergence measures D R α ( f, g), D T α ( f, g).
Now, assuming that for the rest of the document, the density f is unknown, and the density g known and satisfies : is finite. Next, consider X 1 , ..., X n , n ≥ 1 a sequence of independent and identically distributed R d -valued random vectors, with cumulative distribution function F a density function f (·) with respect to Lebesgue measure on R d . The following conditions are needed for the following sections. To construct our divergence estimators we define, We start by a kernel density estimator for f (·), and then substituting f (·) by its estimator in the divergence like functional of f (·). For this, we introduce a measurable function K(·) that satisfies the conditions. [34] first proposed an estimator f (·) and Parzen [26] generalizes thereafter eventually leading to the Parzen-Rosenblatt estimator, defined in the following way for any where 0 < h n < 1 is the bandwidth sequence. Assuming that the density f is continuous, one obtains a strongly consistent estimator f n,h n of f , that is, one has with probability 1, There are also results concerning uniform convergence and convergence rates. For proving such results one usually writes the difference f n,h n (x) − f (x) as the sum of a probabilistic term f n,h n (x) − E f n,h n (x) and a deterministic term E f n,h n (x) − f (x), the so-called bias. On can refer to [15,18,20] , among other authors.
After having estimated f n,h n (·), we estimate D α ( f, g) by setting where A n,h n = {x ∈ R d , f n,h n (x) ≥ γ n } and γ n ↓ 0 is a sequence of positive constant. Thus, using 5, the associated divergences D R α ( f, g) and D T α ( f, g) can be estimated by: The approach use to define the plug-in estimators is also developed in [3] in order to introduce a kernel-type estimators of Shannon's entropy. Some statistical properties of these divergences is related on those of the kernel estimator f n,h n (·) of the continuous density f .
The limiting behavior of f n,h n (·), for appropriate choices of the bandwidth h n , has been widely studied in the literature, examples include the work of Deroye [10,11] Bosq [2] and Prakasa [29]. In particular, under our assumptions, the condition that h n ↓ 0 together with nh n ↑ ∞ is necessary and sufficient for the convergence in probability of f n,h n (x) towards the limit f (x), independently of x ∈ R d and the density f (·). We can find other results of uniform consistency of the estimator f n,h n (x) in [6,15,9] and the references therein. In the next section, we will use their methods to establish convergence results for the estimates D α ( f n,h n , g) and deduce the convergence results of D R α ( f n,h n , g) and D T α ( f n,h n , g).

Statistical properties of the estimators
We first study the strong consistency of the estimator D α ( f n,h n , g) defined in (5). We shall consider another, but more appropriate and more computationally convenient, centering factor than the expectation E D α ( f n,h n , g) which is delicate to handle. This is given by ) and let f (·) be a continuous bounded density . Then, for each pair of sequence (h n ) n≥1 , (h n ) n≥1 such that 0 < h n < h n ≤ h n , together with h n −→ 0, nh n / log(n) −→ ∞ as n −→ ∞, for any α ∈ (0, 1), one has with probability 1 The proof of Lemma 1 is postponed until Section 5.
The following corollaries handle respectively the uniform deviation of the estimate D T α ( f n,h , g) and D R α ( f n,h , g) with respect to D T α ( f, g) and D R α ( f, g).

Corollary 1.
Assuming that the assumptions of Theorem 1 hold. Then, we have This, in turn, implies that lim The proof of Corollary 1 is postponed until Section 5.

Corollary 2.
Assuming that the assumptions of Theorem 1 hold. Then, we have The proof of Corollary 2 is postponed until Section 5. Note that, the divergence estimation such as (5) also requires the appropriate choice of the smoothing parameter h n . The result given in (6), (7) and (8) show that any choice of h between h n and h n ensures the strong consistency of the underlying divergence estimates. In other word, the fluctuation of the bandwidth in a small interval do not affect the consistency of the nonparametric estimator of these divergences.
The work Bouzebda and Elhattab [3] very important for establishing of our results, the authors have created a class of compactly supported densities. We need the following additional conditions.

Corollary 4.
Assuming that the assumptions of the Theorem 2 hold. Then, for any γ > 0 we have The proof of Corollaries 3 and 4 are given in Section 5. Now, assume that there exists a sequence {I n } n≥1 of strictly nondecreasing compact subsets of I, such that I = ∪ n≥1 I n . For the estimation of the support I we may refer to ( [12]) and the references therein. Throughout, we let h ∈ [h n , h n ], where h n and h n are as in Corollaries (3) and (4). We chose an estimator of ζ(I) in the Corollaries (3) and (4) as the form Using the techniques developed in [9] , the Corollaries (3) and (4) one can construct a asymptotic 100% certainty intervals for the true divergences D T α ( f, g), D R α ( f, g).

Concluding remarks and future works
In this paper we are concerned with the problem of nonparametric estimation of a class of divergence measures. For this cause, many estimators are available. The most recent ones are the estimates developed by Bouzebal [3] . We introduce an estimator that can be seen as generalization to those previously suggested, in that he was interested in the estimation of entropy. We are focusing on the Rényi-α and the Tsallis-α divergence measures. Under our study, one can easily deduced Kullback-Leibler, Hellinger and Bhattacharyya nonparmetric estimators. The results presented in this work are general, since the required conditions are fulfilled by a large class of densities. We mention that the estimator D α ( f n,h n , g) in (5) can be calculated by using a Monte-Carlo method under the density g. And a pratical choice of γ n is β(log n) δ where β > 0 and δ ≥ 0. It will be interesting to enrich our results presented here by an additional uniformity in term of γ n in the supremum appearing in all our theorems, which requires non trivial mathematics, this would go well beyond the scope of the present paper. Another direction of research is to obtain results, in the case where the continuous distributions f and g are both unknown. The problems and the methods described here all are inherently univariate. A natural and useful multivariate extension is the use of copula function.

Proofs of our results
Proof of Lemma 1. To prove the strong consistency of D α ( f n,h n , g), we use the following expression where A n,h n = {x ∈ R d , f n,h n (x) ≥ γ n } and γ n ↓ 0 is a sequence of positive constant. Define ∆ n,1,h n := D α ( f n,h n , g) − E D α ( f n,h n , g) .
(K.7) K is a pointwise measurable class, that is there exists a countable sub-class K 0 of K such that we can find for any function ψ ∈ K a sequence of functions {ψ m : m ≥ 1} in K 0 for which This condition is discussed in [34] . It is satisfied whenever K is right continuous.
Let A c n,h n be the complement of A n,h n in R d (i.e, A c n,h n = {x ∈ R d , f n,h n < γ n }). We have E D α ( f n,h n , g) − D α ( f, g) = ∆ n,2,h n + ∆ n,3,h n , with ∆ n,2,h n := Term ∆ n,2,h n . Repeat the arguments above in the terms ∆ n,1,h n with the formal change of f n,h n by f . We show that, for any n ≥ 1, which implies On the other hand, we know (see, e.g, [16] ), that since the density f (·) is uniformly Lipschitz and continuous, we have for each sequences h n < h < h n < 1, with h n → 0, as n → ∞, Thus, sup Term ∆ n,3,h n . It is obsious to see that Hence, Thus, in view of (16), we get sup Finaly, in view of (17) and (20), we get It concludes the proof of the lemma. Proof of Theorem 1. We have

8
Combinating the Lemmas (1) and (2), we obtain It concludes the proof of the Theorem. Proof of Corollary 1. Remark that Using the Theorem (1), we have and the Corollary 1 holds Proof of Corollary 2. A first order taylor expansion of y → log y arround y = y 0 > 0 and y = y > 0 gives log y = log y 0 + 1 y 0 ( y − y 0 ) + o(|| y − y 0 ||).

Thus, for all
+o D α ( f n,h , g) − D α ( f, g) . Let J be a nonempty compact subset of the interior of I. First, note that we have from Corollary 3.1.2. p. 62 of Viallon [37] (see also, [3] , statement (4.16)).
Set, for all n ≥ 1, π n (J) = ≤ sup One fined, by combining (22) and (24) lim sup n→∞ sup h n ≤h≤h n √ (nh) α π n (J) Let {J }, = 1, 2, ..., be a sequence of nondecreasing nonempty compact subsets of the interior of I such that The proof of Theorem 2 is completed. Proof of Corollary 3. A direct application of the Theorem 2 leeds to the Corollary 3. Proof of Corollary 4. Here again, set, for all n ≥ 1,