GIS-Based Analysis of Changing Surface Water in Rajshahi City Corporation Area Using Support Vector Machine (SVM), Decision Tree & Random Forest Technique

: Water is one of the essential natural resources of nature. All living creature depends on water. Living creatures are using water for their different purposes. Earth’s large portion is covered by salt water but very less has fresh water. Freshwater can be found as groundwater and surface water. Surface water is stored as waterbodies on the surface of this world. Ponds, canals, rivers


Introduction
Water is an important element of our nature. All living creatures like humans, animals, plants depend on water. We generally use water for drinking, cooking, washing clothes, irrigation, in industries and so on. Without lifting groundwater, surface water can be used to fill the demand of people's need. But many waterbodies are filling in an unplanned way for different reasons. Many waterbodies are continuously polluting by various harmful human wastes and industrial wastes. The earth's hydrosphere consists of a large amount of water, about, 1,386 cubic kilometers ( ). 97.5% of this water is salt water and from that only 2.5% is stored as fresh water. The greater portion of fresh water (68.7%) is in the form of ice and permanent snow in the Antarctic, Arctic and mountain region. 29.9% fresh water is stored as groundwater and only 0.26% of fresh water is concentrated in lakes, reservoirs and river systems [1].
The main attention of this research is to find the change of waterbodies in Rajshahi City Corporation (RCC) area and by making a dataset test that dataset on different classification techniques. To find the change of waterbodies Geological Information System (GIS) is used. ArcGIS is used to classify images in maximum likelihood classification to extract fea-Corporation Area Using Support Vector Machine (SVM), Decision Tree & Random Forest Technique tures. Preparing a dataset, Support Vector Machine (SVM), Decision Tree and Random Forest technique are implemented to test the dataset.
For indiscriminate earth dumping and unplanned urbanization, almost 4,000 ponds are filled in the past few decades. Rajshahi city used to have 4,238 ponds, canals, and wetlands in 1961 which has been decreased to 2,271 in 1981 and in the year 2000 the number was 729. That means the number is decreasing rapidly. Now, this city has only 214 waterbodies [2].
Bangladesh has a huge population. The area of Rajshahi City Corporation is 95.56 square kilometer, and it has a total population of 3,88,811. The number of the male is 2,08,525 and female is 1,80,286. The Padma river is the main waterbody of Rajshahi City Corporation area [3]. As the population is increasing, the demand for water also increases but waterbodies are being destroyed at an alarming rate. Because of the widespread use of surface water for the increasing population, the groundwater lifting is increasing. The volume of groundwater storage is decreasing. Depletion of groundwater is causing many wells to dry up. It also causes water in streams and lakes to reduce, deterioration of water quality and an increase of arsenic contamination in drinking water [4]. Now it has become an important issue. We should take some steps to preserve these waterbodies and maintain a healthy nature not only for us but also for the future generation.

Literature Review
Some of the studies are on surface water like arsenic contamination area marking by using Geographic Information System (GIS), classification of Hyperspectral Remote Sensing images using Support Vector Machines (SVMs). The effects of change in waterbodies of Rajshahi City Corporation (RCC) area of past few decades is found by surveying. But GIS or other classification techniques are not used for finding or detecting the change of waterbodies. Most of the research is on the fluctuation of groundwater, groundwater pollution using GIS and the quality of surface water and groundwater [5,6].
George, Geeja K., et al. studied the groundwater pollution in an industrial area in Chavara Taluk in Kollam district [6]. This research was on groundwater, and surface water around the study area was affected due to effluents from the industry. Water samples were analyzed, and affected areas were marked using GIS because GIS is not only useful for data capture and processing but also a powerful computational tool that facili-tates multimap integrations.
Melgani, Farid, and Lorenzo Bruzzone used Support Vector Machines (SVMs) on hyperspectral remote sensing images and assessed the effectiveness of SVMs concerning the conventional feature-reduction-based approaches and their performances in hyper subspaces of various dimensionalities, applied binary SVMs to multiclass problems in hyperspectral data [7]. Different performance indicators were used in this study. The result was obtained on a real Airborne Visible/ Infrared Imaging Spectroradiometer hyperspectral dataset, and SVMs was a valid and effective alternative to conventional pattern recognition approaches for the classification of hyperspectral remote sensing dataset.

Methodology
For finding the change of surface water of Rajshahi City Corporation area using GIS, first, we collect images from USGS EarthExplorar. Multispectral imagery usually has 3 to 10 bands and each band is obtained by using remote sensing radiometer. In ArcGIS, the images are classified to maximum likelihood classification by generating signature files. Areas are calculated from the attribute table and values are converted to acre by using field calculator because the area in Bangladesh is generally calculated in an acre. Features are extracted from maximum likelihood classification. Finding the percentage of waterbodies, this dataset is classified into three different groups. This dataset is used to classify in Support Vector Machine (SVM), Decision Tree and Random Forest technique.
Different percentage of accuracy is measured because each of the technique is different from each other.

Image Collection
Images of Rajshahi City Corporation (RCC) area are downloaded from 1987 to 2016 for preparing a dataset. These images are collected from the United States Geological Survey or USGS EarthExplorer. The shapefile of the study area is collected from Rajshahi Development Authority (RDA). Landsat 4-5 Thematic Mapper (TM) images and Landsat 8 Operational Land Imager (OLI) images are collected for image classification. All images are not of the same type because Landsat 4 was terminated on December 14, 1993, and Landsat 5 terminated on June 5, 2013 [8]. Landsat 4-5 TM has seven different bands and Landsat 8 OLI has 11 different bands.

Image Classification
To classifying the images, ArcGIS tool is used. Data classification can be done using three different techniques; these are supervised, unsupervised classification and object-based analysis. Generating a signature file the images are classified in maximum likelihood classification. Maximum likelihood is a very popular method for remote sensing. It has some advantages such as it can be developed with a variety of estimation situations and the method is generally used for mathematical and optimality properties [9]. Maximum likelihood classification is a supervised classification technique. When instances are given with known labels, they are called supervised learning [16]. This can be done in ArcGIS. This can be done is four basic steps [17].
1. Firstly, enable image analysis toolbar from ArcMap. 2. Secondly, select training areas by drawing polygons which denote the specific areas. For training areas, we have to multi-select polygons and merge into a single class. 3. Thirdly, generate a signature file by merging and renaming accordingly. 4. Finally classify using maximum likelihood classification, iso cluster, class probability or principal component. Each one of them has its own advantages. From these techniques, maximum likelihood classification is used.

Feature Extraction
For feature extraction, the classified images are used in reclassification. Each image has four features, and they are waterbodies, vegetation, urban area and open space. In Figure   3 red color is an urban area, the blue color is waterbodies, the green color is a vegetation area, and the yellow color is for open space. These have been detected by changing the color bands of the images. After reclassifying each image the values of the attribute table are calculated in square kilometer and then in the acre. Values are converted to the acre as the area of Bangladesh in generally measured in an acre.

Dataset Preparation
After reclassifying all the images from 1987 to 2016, a dataset is generated. This dataset has 25 values. This dataset has 25 rows and 6 columns.

Classifier
For classifying this type of dataset Support Vector Machine (SVM), Decision Tree and Random Forest technique has been used.

Support Vector Machine (SVM)
Support Vector Machine (SVM) is a biased classifier that is defined by a separable hyperplane. In a two dimensional space, the hyperplane is a line dividing a plane into two parts. These two parts lay on either side of the hyperplane [11]. Generally, SVM and neural networks give better performance in dealing with multiclass and continuous features, and logic-system performs better regarding discrete values [16].
It is a supervised machine learning algorithm that can be used in classification or regression but mostly used in classi-fication. Standard Support Vector Machines are designed for dichotomic classification problem, but multi-class classification problem is solved by decomposition to several binary problems for which standard SVM can be used [10]. For instance, one-against-all decomposition is normally applied. Different types of kernel tricks are used in SVM.
A training set of instance-label pairs , , 1,2,3, … … … where ∈ and ∈ 1, 1 , the SVM require the solution of the following optimization problem: By the function ∅ training vectors are mapped into a higher dimensional space. In the higher dimensional space SVM finds a linear separating hyperplane with the maximal margin. For the error term % 0 the penalty parameter. The kernel function is formed as '( , ) * ≡ ∅ ∅( ) *. There are four basic kernels: These are the kernels of SVM and here -, . and ; are kernel parameters [12].

Decision Tree [13]
To determine suitable property for each node of a generated decision tree, information gain approach is used. The attribute that has the highest information gain is selected as the test attribute of the current node. Use of the property to partition the sample contained in the current node makes the mixture degree of different types for all the generated sample subsets reduce to a minimum.
< is a set that includes = > number of the data sample.
Where the probability is E = < /|< ) | which is any subset of data samples belonging to categories . Where < ) contains the data sample whose attribute H are equal > ) in < set.
Consider that H is the property which has I different values {> , > , … … … . , > J }. By using the property of H, < can be divided into I different number of subsets {< , < , … … , < J }. If property A is selected for test that is used to make a partition for current samples, suppose that < ) is the sample set of type in the subset < , the information entropy is, The obtained information gain is, Q> R(H) = ?(= , = , … … , = @ ) − K(H) (8)

Random Forest
Random Forest is a supervised classification algorithm. It is an effective tool in prediction because of the Law of Large numbers they do not overfit. By injecting the right kind of randomness, it can be made accurate classifiers and regressors. Random features and random inputs produce excellent results in classification but less in regression [14].
Instead of one decision tree, random forest uses a collection of decision trees [15]. Θ denotes the set of possible attributes, and h (x, Θ) denotes a tree grown using Θ to classify a vector x. By using the above notations the random forest f can be defined as, Where = 1,2,3, … … , ' and U V ⊆ U. It means that the random forest is a collection of trees where a tree is grown with a subset of possible attributes. For XY tree U V is randomly selected and it is independent from the past random vectors U , U , … . . , U VZ .
Random Forest performs better than a single decision tree. This can utilize unnecessary features and the independence of the different classifiers (trees) use.

Dataset Description
In this work, there was no prepared dataset. So we have to prepare a dataset from all the classified images in ArcGIS. This dataset has mainly six attributes. It has been possible to collect 25 images, so the dataset has 25 values. This dataset is built by classifying all the collected images from USGS in ArcGIS. This paper aims to implement this dataset in different classification techniques.

Experimental Setup
To conduct this experiment, we have three different classifiers namely Support Vector Machine (SVM), Decision Tree and Random Forest. The experiment is performed based on Corporation Area Using Support Vector Machine (SVM), Decision Tree & Random Forest Technique five-fold cross-validation as the dataset consists of only 25 values. For measuring the performance, we have selected the split ratio = 0.6 that means 60% data is used for training and 40% data is used for testing.

Result and Discussion
Precision, recall, and f1-score are calculated because accuracy measurement is not enough for evaluating the performance of any classification algorithm. The confusion matrix is a performance indicator that gives information about the correctly and incorrectly classified instances number for each classifier. As we have three outcomes, each classifier generates a 3X3 confusion matrix. There are four elements in the confusion matrix and they are:     Classification report for each classifier is given below:

Conclusion
The objective of this research is to classify the images to find out the changes of the waterbodies in Rajshahi City Corporation (RCC) area and implement the dataset on different classification techniques. Because waterbodies are decreasing drastically but some years, this area faces flood and some year faces drought. So the amount of waterbodies is not constant each year. In this research paper, the dataset has only six attributes with three different classes. Support Vector Machine (SVM), Decision Tree and Random Forest are implemented on this dataset. Most of the research papers have shown the quality of the surface water and groundwater, amount of arsenic in tube-well water, fluctuation of ground-water, applied GIS on the arsenic contaminated area but generating a dataset from images and implement that dataset on any data mining algorithm has rarely done. In this research, used images are multispectral images as they have 3-10 bands and it has more than two classes, so it is a multiclass problem. Precision, recall, and f1-score are calculated for three class problems. We found 92% accuracy in Random Forest Techniques. Which indicate it performs better in this kind of datasets. This type of dataset can be used in different classification techniques. By doing efficient coding the accuracy can be increased.