Fitting Models of Vulnerability to Toxicity with Generalized Linear Models

People are often exposed to toxic or hazardous (e.g. radioactive radon and lead) elements and rays, without even knowing so. Toxicity often results from an individual’s prolonged exposure to toxic substances. A thorough examination of some individuals’ blood or urine samples for the quantities of hazardous substances or elements, often gives a multivariate data (i.e. matrix of cases against elements) on toxicity. The pertinent response variable is often binary response (or count data) type and hence the Generalized Linear Models (GLM) of it can be fitted using our proposed techniques. This paper purports to identify models in GLM that can be used to study toxicity when it is ‘captured’ as count data or Binary Response Variables (BRV). An illustration of how the techniques work is done by using a sample of data on some artisans.


Introduction
Pollution happens in various ways; environmental and occupational exposures to pollutants are usually experienced by artisans [4]. Environmental pollution can sometimes be due to human inappropriate activities (e.g. the dumping of toxic waste in residential locations) or natural (e.g. the natural emission of radioactive radon in residential buildings (indoor radon)) [5] [6] [7]. In the former case, the activity can be easily stopped, whilst in the latter case, little or nothing can be done. With respect to occupational exposure; some measures can be put in place to usurp their effects on humans or, at least, reduce the effects to the barest minimum. It is because of occupational exposure that the technologists, and artisans, working in radioactive environments, are strongly advised to display the 'symbol' for radioactivity in a conspicuous location around their laboratories and workshops respectively and to always use protective gadgets and advise their patrons and customers to do the same. Accidental exposure is also possible, it may happen in a mineral mining field or at a nuclear energy station such as the one of Chernobyl and Fukushima [4]. When people are in a polluted environment, they are said to be exposed to dangerous (or toxic) substances (e.g. indoor radon, fungi spores, and lead). Hence toxicity, in an individual, often results from his/her prolonged exposure to toxic substances. The individual will be 'pronounced' toxic, with respect to the toxic substance, if the estimated quantity of the substance found in the samples (e.g. blood or urine) from his/her body is higher than the quantity that can be tolerated by a human body (i.e. without associating any allied ailments). Upon a thorough examination of some individuals' blood or urine samples for the quantities of 'hazardous' substances or elements, a multivariate data (i.e. matrix of cases against elements) on toxicity will be obtained [3]. With respect to a count data or response variable ( , 1, 2,..., i y i n = ), that is, dichotomous in nature, generalized linear models of toxicity can be fitted [2] [6] [7]. EDA tools are very 'restricted', in usage, and subject to misinterpretations with respect to these two cases (i.e. count data and binary response variables) because the numerical code of each BRV, say, is either zero (0) or one (1).

Exploratory Data Analysis (EDA) and Binary Response Variables (BRV)
Any BRV ( y ) is necessarily dichotomous in nature. That is, it can have either of the following pairs of responses; yes or no, high or low, tall or short, diseased or not-diseased, alive or dead etc. BRVs are usually coded with 1 or 0 with respect to the analyst's discretion. For instance, an analyst may adopt the following with respect to his/her BRV for a particular work: Equations (1) and (2) are strong indications for the Bernoulli (

( )
Ber p ) distribution. Because of the relationships existing amongst the; Bernoulli, Binomial, Poisson, Normal (i.e. the Exponential Family of Distributions (EFD)), it is reasonable to model the BRV with the GLM which 'toggles' around EFD easily. The choice of an EDA tool however, may be inappropriate because, they often results in non-informative descriptive or pictorial representations. For example a histogram or a box-plot will contain just two pictorial representations with very 'little' information on y. Also the stem-and-leaf plot often results into descriptive representation having just two lines of zeros and ones. Although cluster analysis still possess 'little' usefulness but they can only be used to 'split' the responses into just two clusters as well. This paper purports to identify models in GLM that can be used to study toxicity when it is 'captured' as count data or BRV.

Toxicity and GLM in R
GLM are extensions of traditional regression models that allow the mean to depend on the explanatory variables through a link function (e.g. log, logit, probit, cloglog, identity, sqrt) and the response variable to be any member of a set of distributions called the EFD. Toxicity can be studied through GLM and the R language in two ways; when the 'experimental units' or organisms are monitored to mortality and when 'experimental units' are just 'screened' for vulnerability to toxicity. The R function for fitting a generalized linear model is "glm()". There are many methods (or commands) for 'glm objects', they include; "summary", "coef", "resid", "predict", "anova" and "deviance" [2].

Assumptions on Variables and General Setup for GLM in R
Throughout, we shall assume that; 1. BRV or count data (Y (n X 1)) and their corresponding multivariate data (X (n X m)) can be represented as below 11 2. Y further follows one member of the EFD (e.g.
We are required to estimate the parameters ( β (m X 1)). Now let η and µ denote the natural and mean parameterizations of the pertinent member of the EFD. Then; 3. There is a scale parameter φ through which we can estimate over-dispersion. Over-dispersion essentially describes the situation whereby the actual V (Yi); for some i = 1, 2, …,n, exceeds the GLM variance

When the Organisms are inspected for Information on Toxicity
Here, the organisms (i.e. experimental units) or simply "units" are assembled and samples (e.g. blood and urine samples) are taken from each of them and used to estimate the quantities of the toxic elements available in each of the "sampled" organism. An array of quantities in which columns are allocated to toxic elements and rows, allocated to cases (i.e. units) is the matrix X, in the system of equation (3). The matrix of the response variables Y is usually unknown at the beginning, but with this technique, the matrix X will be used to data-mine some "hidden" information about Y. Such "data-mined" information on Y is either BRVtyped or count data typed depending on the quantity of hidden information that can be accessed through this technique. If universal "tolerance limits" exist for the units with respect to the toxic elements then a count data type Y is achievable otherwise (i.e. if they exist with respect to some or no toxic element), a BRV type Y is achievable.

Determination of Matrix Y When Tolerance Limits Exist for all Toxic Elements
The unit of measurement of the quantity of toxic element is either "ppm" or " ). The following section (3.5) contains a vivid numerical illustration of this technique. The data for this illustration was obtained through samples from artisans operating in some mechanic villages (along Abeokuta-Ibadan expressways) around Abeokuta metropolis. The matrices Y and X are "supplied" together to R, such that the matrix Y occupies the "V1" location and the tolerance limits are excluded. The resulting "data-frame" is named "dat1". The R codes for this operation are as contained in the three commands below;

Numerical Illustration on Vulnerability to Toxicity When all Tolerance Limits Exist
If the above three commands are immediately followed by; >outdat1<-glmulti(V1~V2+V3+V4+V5+V6+V7+V8+V9, data=dat1, method="g", maxit=30) Then the Genetic algorithmic process to carry-out iterations and identify the formula (model) that will best "fit" the "contents" of dat1 is automatically initialized. An "extract" of the immediate response from R is; Initialization To further show that R's choice of "Best model" is "reliable" and to carry-on, the researcher needs to give the codes below; The immediate response of R is as contained in the figure 3 below; Now, besides the estimates of the coefficients, there are two noteworthy values, in figure 6, they are the "residual deviance" (26.106) and the "AIC" (411.28). These two values help "certify" it that our model certainly leads to the best fit. Further, if we had chosen the "simplest" model (as it is usually done without the use of the "glmulti" function), then we would have supplied and receive (output) respectively, the content of figure 4 below; Notice that; 1. The residual deviance that was 26.106 (with our best model) has now risen to 39.306 (with the usual model) and the AIC that was 411.28 has now risen to 412.48. 2. There is no "over-dispersion", the evidence for this is contained in figure 5 below;

Figure 5. Showing that There is no Over-dispersion Since the Variance of Y (i.e. 2.542445) is Less than Its Mean (i.e. 3.517) Which is an Unbiased Estimate of the Variance in Poisson Distribution.
Consequently, the fit for "dat1" is;

Further Diagnostic Checks on the Fit for Cases in Which all Tolerance Limits Exist
There are other diagnostic checks that corroborate the fact that equation (4) is the best fit for the data (dat1), some of them are; 1. The plot of Y against the residuals which gives the following (figure 3); 2. By comparing the AIC (for out1), in figure 6, with AIC (for out2), in figure 7, we can easily see that the fit for out1 is better than that for out2. The following pairs of statistics (figure 7) also testify to this fact;  ), its numerical value, for the entire data (dat1) will be obtained with the command "log10(fitted(out1)"). An extract of the probabilities is contained in figure 9 below (i.e. by taken just four decimal places);

Determination of Matrix Y When Tolerance Limits Exist for Some or no Toxic Elements
The determination of the response matrix Y (i.e. a BRV) is through the matrix X which is used to data-mine it, using the following technique; the elements whose tolerance limits exist are "assumed" to be the "main" variables whilst all other variables are "assumed" to be "auxiliary". In the data matrix X (i.e. in the equation 3), the first toxic element is actually "lead". Although the human body does not possess any tolerance for lead (i.e. no matter how small the quantity of lead, it is still hazardous to man). However, Nriagu et al. in his/her blood. The mechanic villages from which the data in matrix X were obtained are on the two existing Abeokuta-Ibadan express-roads; hence the value 15.1 / g dL µ is quite useful in the present work. If we assume that, as the children grow to be adults, they acquired more, say about 4.9 / g dL µ "environmental" lead into their body systems. Then a non-artisan adult in Ibadan and its environs is expected to have, on the average, 20 / g dL µ of lead in his/her blood. Consequently, if an artisan has above this quantity (i.e. 20 / g dL µ ) in his/her blood, we can "safely" assume that the additional quantity is due to occupational toxicity. The value 20 / g dL µ was therefore utilized as the lead (i.e. the main element) toxicity limit for the artisans. By determining toxicity limits for the auxiliary variables (or by using their estimated population mean as toxicity limits), the auxiliary variable were used to "finetune" (in the sense that, if four or more of the auxiliary variables are above their toxicity limits, their corresponding y i , i=1,2,…,n that was formally 0 will become 1. Also y i that was 1 before can become 0 if its main element is within +10 over its toxicity limit and only very few, say one of its auxiliary variable value is more than its toxicity limit) to obtain the concatenated matrix whose extract is; Figure 10. Showing an Extract of the Concatenated Matrix (Y:X) Which is now the "Input" to R (Below the Line are the Toxicity Limits).

Numerical Illustration on Vulnerability to Toxicity When Tolerance Limits Exist for Some or no Toxic Elements
The corresponding data-frame is "dat3", here; Y = y i = 0 or 1 (i.e. 0 denotes that the vulnerability to toxicity is "relatively" low in this particular case when compared with the other cases in the data, whilst y j = 1 denotes it is relatively high) we now proceed as before to obtain the best model that fits the data as (figure 12); > outdat3<glmulti(V1~V2+V3+V4+V5+V6+V7+V8+V9,data=dat3, method="g", family=binomial) Initialization We now continue with a couple of commands in figure 12. That is; The result is as contained in figure 13 below; Figure 13. Showing the Result of the Fit for "dat3".
The vulnerability to toxicity of all the artisans are obtained together through the use of the following three commands ( figure 14);

Conclusions
The following conclusions can be reached on the entire work, the issue of the command; > anova(out1, test="Chisq") will generate the following "analysis of deviance table" associated with the best fit (frame 11); Analysis of Deviance This gives the researcher some hints about some probable interacting toxic elements. Although the response, Y for the case in which toxicity limits exist for some toxic elements has been coded with 0 and 1, but if it is coded as "FALSE" and "TRUE" (i.e. such that FALSE=0, TRUE=1), it will still work. These results ought to enhance the effectiveness of awareness campaigns informing artisans of the need to always put on their respective "safety" gadgets whenever they are at work. Artisans with high (i.e. TRUE) vulnerabilities to toxicity will know that they really have to exercise caution as much as possible. The results with respect to these coding technique are as contained in the following ( figure 16, figure 17); Figure 16. Showing an Extract of "dat4" that was Supplied to R.
The dat4 was followed by the command; outdat4<glmulti(V1~V2+V3+V4+V5+V6+V7+V8+V9,data=dat4,met hod="g", family=binomial) which initiates the iterations to determine the best model that fits the data (dat4), an extract of the result of which is contained in figure 20 (1+exp(ci)). An extract of the result is contained in frame 14 ; Figure 19. An Extract of the Result on Predicted Vulnerability to Toxicity.
Any of the approaches; depending on the type of data the researcher has (i.e. BRV or count data), could be adopted for any survey work on vulnerability to toxicity.