Applications of Cluster Analysis Method in Surface Water Quality Assessment: A Case Study in Balihe Lake, China

Analyses on the spatial evolution and distribution of surface water quality are important to the treatment and protection of water environment in a lake. In Balihe Lake, an inland freshwater lake in east China, 7 water environmental factors at 45 sampling sites were monitored and served as the basis of this study. Cluster analysis (CA), a multivariate statistical analysis method, was utilized to study the spatial variation and grouping of these sampling sites based on the monitored water quality data. The results of this study showed that the water quality characteristics at these 45 sampling sites, which was grouped into the clusters of upstream, midstream and downstream, highly depended on the spatial location of the lake. Some nutrients content of the upstream area was much higher, while the water quality of the downstream area was much better although some of water quality indicators at the outlet still didn’t match the standards of local government. The CA results of the study may provide some guidance to the priority areas of water environment protection or treatment for the government.


Introduction
Since the beginning of the 21 st century, China's agricultural water consumption has been showing a small downward trend [1][2][3]. According to the China water resources bulletin, China's total agricultural water use decreased from 378.62 billion m 3 in 2000 to 374.35 billion m 3 in 2011. The proportion of agricultural water use in the total water use decreased from 68.8% in 2000 to 61.3% in 2011, a decrease of 7.5%. However, the decrease of agricultural water usage did not lead to improvement of water environment due to the extensive use of fertilizers and pesticides. Then, large amounts of nitrogen and phosphorus ran into rivers and lakes by runoff, and caused serious eutrophication of water bodies, and further promoted the blooms of algae and the decrease of dissolved oxygen (DO). Normal air-water and mass transfer process was therefore affected, and caused massive death of fish and other organisms in aquatic ecosystems. These dead plants and animals continued to rot in water, thus brought a vicious impact on water quality [4].
It has been widely recognized that effective, long-term management of rivers requires a fundamental understanding of hydro-morphological, chemical and biological characteristics. Therefore, it is necessary to perform a comprehensive water quality monitoring program and evaluate the water environment scientifically since that the spatial variation of water quality is often difficult to interpret [5][6][7].
In this study, CA method was utilized to cluster the spatial similarity of the water quality in a natural freshwater lake. Corresponding to the clusters induced by CA, the major pollutants were also analyzed. The application of CA, a multivariate statistical method, was therefore explored in the area of water quality evaluation.

Study Area
Balihe Lake is located in Yingshang county, southeast of Fuyang city, Anhui province, China. The geographical coordinates of the Balihe Lake are 116.01°-116.38°N and 32.54°-32.57°E. Balihe Lake is Lake is a river-like lake with a narrow East-West structure. Its East-West length is about 15km on average. Totally, the Balihe Lake covers an area of 15.8 km 2 . In addition, the lake is divided into upstream and downstream by a dike-bridge which was constructed by the local government near the center of the east-west direction ( Figure 1). As an agricultural watershed, the Balihe Lake is surrounded by farmland and villages [8].
The water quality safety in Balihe Lake has been seriously threatened by the agricultural non-point source pollution all the year round due to the highly intensive agricultural activities, livestock and poultry breeding industry, as well as the production and processing mode of sweet potato starch.
According to surveys, the contribution of rural sweet potato starch wastewater to COD (chemical oxygen demand) pollution load in Balihe Lake was 65%, while livestock breeding and domestic sewage were 18% and 15%, respectively. Livestock and poultry farming contributed the most to the NH 4 + -N (ammonia nitrogen) pollution load in the watershed, accounting for 67%, and starch wastewater contributed 20%. Livestock and poultry farming also contributed 45% and 44% to TN (total nitrogen) and phosphorus pollution in the watershed, while 25% and 8% came from farmland cultivation, and 25% and 8% from starch wastewater. In Balihe Lake, the highest content of COD and NH 4 + -N have ever reached 234.60 mg/L and 1.96 mg/L, respectively [8].

Monitored Parameters and Analytical Methods
Water samples were collected at 45 sample sites in October 2017. Among them, sites 1-3 are near the inflow of the Balihe Lake, while sites 43-45 near the outflow. Sites 4-24 are evenly arranged in the upstream and sites 25-42 are distributed in the downstream part of the lake. The dike-bridge was determined as the interface between upstream sites and downstream sites ( Figure 1).
Seven water quality indicators including COD, NH 4 + -N, TN, TP (total phosphorus), pH (potential of hydrogen) WT (water temperature) and Chl-a (chlorophyll a) were selected to monitor and evaluate the surface water quality of Balihe Lake. Among them, pH and WT were directly measured in situ using a multiparameter water quality monitoring instrument (YSI Pro Plus, USA). For the other monitored water quality indicators (COD, NH 4 + -N, TN, TP and Chl-a), water samples were collected 1m below the water surface and store in plastic bottles then complete the relevant analytical experiments in laboratory within 48 hours. E.g., COD was analyzed ex situ with fast digestion spectrophotometric method and Chl-a with spectrophotometric method. TN, NH 4 + -N and TP were measured with alkaline potassium persulfate digestion-UV spectrophotometric method, Nessler's reagent colorimetric method and ammonium molybdate spectrophotometric method, respectively. And all the analytical methods following the Monitoring and Analysis Methods of Water and Wastewater (4 th Edition) [9].

Statistical Analysis
Systematic cluster analysis is a process of distinguishing or classifying objects scientifically according to their similarity. Firstly, according to some groups of data, the statistics that can explain the similarity degree between these data groups can be found out. Then, based on these statistics, some variables with large similarity degree can be synthesized into one cluster, while the variables with small similarity degree will be synthesized together in another cluster. Finally, a complete taxonomic pedigree can be drawn according to the similarity between different types or groups. The similarity mentioned here is defined by the distance between two adjacent clusters. The principle of category merging is that different clusters have very large differences and the differences within one cluster are very small [10][11][12][13].
CA is one of the most commonly used multivariable statistical methods, whose analyzed results are often expressed with the dendrograms. Its largest advantage is that it does not need to know the exact structure of the classification object beforehand, but only a batch of data is needed. Based on the selected classification statistics, calculations according to the indicated steps can be performed and a complete classification dendrograms can be finally obtained [14][15][16][17].
The basis of CA is the difference between data, that is, the calculation of distance. In the process of system clustering, the calculation methods of distance between classes can be divided into single connection method, complete connection method and average connection method. The Square Euclidean distance method is used in this research.

Status of Surface Water Quality of Balihe Lake
Through the analysis and parameter monitoring of the water samples collected from Balihe Lake in October 2017, the values of these seven water quality indicators at these 45 sampling sites can be obtained. These data are shown in Table 1.
According to the environmental quality standards for surface water in China (GB3838-2002) [18], grade V is the lowest level of surface water quality identified in the standard. The corresponding limit values of TN, NH 4 + -N, TP and COD are better than 2.0 mg/L, 2.0 mg/L, 0.4 mg/L and 40 mg/L, respectively. However, these four water quality parameters in the Balihe Lake are far beyond this standard range during the sampling period, which means that the water quality in Balihe Lake was still facing tremendous challenges. In the sampling period, the highest COD concentration in the lake is up to 234.60 mg/L, which was found at sites 7 within the upstream area. The declining of COD from upstream to downstream demonstrated that Balihe Lake had a certain self-purification function of water quality. The range of TN was from 1.52 mg/L (site 9) to 3.52 mg/L (site 21), while 0.74 mg/L (site 2) to 1.98 mg/L (site 43 and 45) for NH 4 + -N. The lowest TP value is 0.42 mg/L at site 33 and the maximum is 0.74 mg/L at site 10 and 11. Generally speaking, the highest nutrient content (e.g. TN and TP) appeared at the upstream sites, with the exception of the highest NH 4 + -N content, which appeared at downstream (site 43 and 45). The concentration of COD and TP decreased gradually with the direction of flow, showing an obvious spatial distribution law. Increased NH 4 + -N content at the outlet indicated that nitrifying bacteria are still active in the Balihe Lake.

Spatial Variations in Surface Water Quality
In this research, the method of cluster analysis was used to analyze these monitored parameters of 45 sample sites in the Balihe Lake. Due to that the dimensions of these parameters are different; these parameters were standardized firstly.
Z-score method was selected as the standardized method in the CA processes. And the distance index used in this study was Squared Euclidean Distance. Finally, as can be seen in Figure 2, the CA results showed that these 45 sampling sites were divided into three main clusters, which were named as cluster 1, cluster 2 and cluster 3, respectively. Sites within each cluster should have similar features and pollution source types. In the research of Sun et al., the sampling sites along the studied river was also categorized into four different clusters based on the CA results of water environmental factors [19]. Although the sits grouping based on CA results in this study were different to other studies like Sun et al.'s, it should be noticed that the sampling sites showed a reasonable consistency in their locations which meant the spatial distribution. In addition, these clusters determined by judging their water quality might be primarily influenced by the land use surrounding the studies area. As can be seen in Figure 2, the clustering of 45 sites is entirely classified according to their spatial distribution in the lake. The sampling sites in cluster 1 are near the entrance area of the Balihe Lake, that is, the upstream area. The dividing line between cluster 1 and cluster 2 is the dike-bridge. The presence of dike-bridge made the water flow in the lake shrink and the fluidity increase, which may further increase the aeration in water.
In order to realize the water quality characteristics of each cluster, the box-plot of these seven parameters were drawn and shown in Figure 3. As it can be seen, the mean COD content of cluster 1 was the highest among three clusters, A Case Study in Balihe Lake, China reached 192.53 mg/L. The average content of TP in cluster 1 was also the highest, which is up to 0.64 mg/L. In cluster 2, it can be clearly observed that the mean values of WT, TN and Chl-a reached the maximum values, which were 18.11°C, 3.04 mg/L and 28.97 µg/L. The possible reason for the higher TN for the sites in cluster 2 was the discharge of human domestic wastewater on both sides of the bridge. For cluster 3, the average concentration of TP and COD attained the minimum values, which were 0.48 mg/L and 16.25 mg/L, respectively. In addition, the other parameters were also relatively lower than those of cluster 1 and cluster 2 except for NH 4 + -N. Consistent with the preceding statement, the maximum mean value of NH 4 + -N could be found in cluster 3, reaching 1.75 mg/L. All these supported the point mentioned above, which indicated that the sampling sites in each cluster have similar features and pollution types. Considering the self-purification function of the lake, the pollution source control in the upstream area should be emphasized for the water environmental protection, although the improvement of the structure and function of the lake ecosystem was also needed. However, more attention should be paid on the water pollution source identification, especially based on newer and more reasonable methods to make the water environmental protection work more targeted [20].

Conclusions
Based on 7 monitored water quality indicators at 45 sampling sites in Balihe Lake, the surface water quality in the lake was evaluated with CA method. From the CA results, it can be clearly concluded that the water quality of 45 sample sites in Balihe Lake was classified strictly according to their spatial distribution. Because of the agricultural pollution in the upstream, the water quality in the upstream area was the worst. Owing to there is a dike-bridge in the middle of the lake, the aeration of the lake water was increased and the water quality in the downstream was obviously improved. Although further research on the water quality evaluation in Balihe Lake are needed, the results of this study revealed that both pollution source control in upstream area and ecological restoration in the lake are necessary to promote the water quality reaching the fixed goal. The CA results of this study may provide some inspirations to the formulation and implementation water environmental protection and treatment policy for the local government.