Localization Method Based on Image Processing for Autonomous Driving of Mobile Robot in the Linear Infrastructure

In recent years, the deterioration of infrastructure facilities such as bridges has become a problem. Precautionary measures such as visual inspection and repair by humans are in place as countermeasures for aging; however, there are issues with cost and safety in such inspections. If inspection by robots becomes possible, both these aspects will be improved, which will significantly contribute to the maintenance of infrastructure facilities. In this paper, we propose a complex image processing technique to specify the location of feature points as coordinates through smartphone cameras to obtain the location information of feature points needed for positioning BIREM-IV-P developed to support bridge inspection. The corners located in the bridge inspection environment are used as feature points, and the corners are specified using Harris corner detection, which is a conventional corner detection method, to obtain the position of the feature points. In addition, to compensate for the shortcomings of Harris corner detection, a line segment in the image is detected using the Hough transform, and the intersection points of the line segments are recognized as corners. By combining the results of the two detection methods in this manner, the target feature points can be accurately specified. Then, the position of the feature points of the specified image coordinate system can be changed to the world coordinate system. As a result, it was possible to detect the location of the target feature points in a three-dimensional coordinate system.


Introduction
In recent years, the aging of infrastructure facilities, such as bridges and tunnels, has become a problem. Much of the social infrastructure in Japan has been built after the high economic growth period. Although recently constructed bridges are designed to have a lifespan of 100 years, the general lifespan of bridges is known to be around 50 years. In addition, the proportion of bridges that have passed 50 years since construction is approximately 27% as of 2019, which will surge to approximately 52% after 10 years. In particular, the proportion of bridges shorter than 15 m in length will be approximately 59% in 10 years. In addition, there are about 230,000 road bridges in the country with unknown construction years. At present, preventive measures such as visual inspection and repair by people are being undertaken as measures against aging. However, the bridge inspection method is currently performed by inspectors using scaffoldings and special crane vehicles, which require considerable time and inspection cost, and the lack of inspectors and safety issues pose problems. Therefore, if inspection by robots becomes possible, these problems will be alleviated, thereby significantly contributing to the maintenance and management of infrastructure facilities.
In recent times, various robots are being developed as bridge inspection robots [1][2][3][4][5][6][7][8][9]. There are various types of robots, such as flying, suspension, and adsorption types, but they are mainly divided into flying and adsorption types [10][11][12][13][14][15][16][17]. Among them, the flying robot has a problem in that it is difficult to control because its balance is easily affected by disturbances such as wind and contact with the bridge. In addition, there is a problem not only when moving, but also when hovering, because power is consumed to maintain posture and position; thus, it is not possible to conduct a long-term inspection. In contrast, adsorption robot can also be moved by adsorbing on the ceiling surface of the lower part of the bridge, and unlike flying robot, they do not use excessive power to control their posture and position even while stationary for bridge inspection. Therefore, in this laboratory, the Bridge Inspection Robot Equipped with Magnets version IV (BIREM-IV) was developed as a robot for bridge inspection [18]. The robot's features include the ability to move freely on steel structures because of the use of magnetic rimless wheels and a high driving capability. However, BIREM-IV and other bridge inspection robots with these characteristics are directly controlled by the inspector; hence, there is a requirement to develop a bridge inspection robot that is fully autonomous.
The most important aspect of autonomous driving to detect of the robot's localization. However, it is difficult to detect the position of the robot as coordinates using the Global Navigation Satellite System (GNSS) because data from satellites cannot be received inside large buildings, such as inside bridge box girders. Therefore, using the characteristics of the light detection and ranging (LiDAR) sensor, we developed the planar LiDAR, which detects the position of the robot as coordinates based on the distance and direction data from the feature points fixed in the experimental environment, and installed it in BIREM-IV [19]. Figure 1, Figure 2, and Table 1 present the appearances, components, and specifications of BIREM-IV-P, respectively.   The manufactured planar LiDAR can obtain 360° field of vision information based on the robot by rotating the long measuring distance and a compact one-dimensional LiDAR sensor using a stepping motor. In addition, the feature points of the inspection site can be set in advance using existing information on the bridge structure (design drawings and photos), and the position of the robot can be simply specified based on the feature point information. From these features, BIREM-IV-P is capable of localization using feature point information obtained from planar LiDAR. However, these methods can cause problems where fewer feature points are recognized by planar LiDAR; the greater the distance between the robot and feature points, the greater the error. Therefore, in this study, landmarks present in the advancing direction of the robot are detected using a camera, recognized as feature points, and three-dimensional position coordinates in millimeter units are derived from the obtained two-dimensional position coordinates in pixel units. Then, the reliability of the three-dimensional position coordinates of the obtained feature points is confirmed.

Specific Feature Point Detection
In this section, we propose a method for sensing corners in the advancing direction of the robot and detecting target feature points among the detected corners. We aimed to recognize the corners in the experimental environment through image processing of the video captured using a smartphone camera (OS: Android, Model: Galaxy S6 Edge Plus, Manufacturer: Samsung Electronics Co., Ltd). The feature points are fixed in the experimental environment, and their positions are specified as coordinates. As a pioneering study in corner detection research, Moravec's corner detection method has been studied [20], based on which numerous methods have been proposed [21][22][23][24]. As shown in Figure 3, corner detection sets a small area (window) of an image, and when that area moves gradually, it is recognized as an edge or corner if the rate of change is large. If the rate of change on the x-axis or y-axis is large, it is an edge, and if both are large, it is a corner.

Traditional Corner Detection
However, this detection method clearly shows shortcomings at the edge of a diagonal line, because the diagonal line does not exhibit a large change in both the x and y directions. Thus, the Harris corner detection method solves this problem [25].
The Harris corner detection is used by converting the camera's two-dimensional color images to grayscale. If the image converted to grayscale is denoted as , and the amount of change between the point ( , ) on the area and the point ( , ) moved by and respectively in the x and y directions is , , it can be expressed as where ( , ) is a window function that represents a Gaussian filter. Thus, Equation (1) calculates the amount of change smoothed by the Gaussian filter. Furthermore, if this equation is approximated using the Taylor expansion, it becomes , , The matrix in this equation is where and are image derivatives in the x and y directions respectively. When [ , ] is squared, it becomes the matrix part of . This matrix part is a matrix describing the variation of , and it is possible to determine edges or corners by singular value decomposition. However, because it is difficult to calculate complicated eigenvalues, we use the following equation: where det is the determinant of the matrix, trace is its trace, is a constant that is generally 0.04 to 0.06, and $ % and $ are the eigenvalues of . It can be seen that the value of can be obtained from these. In addition, the eigenvalues of can be classified as shown in Figure 4. As a result, we were able to detect the corners using the Harris corner detection method. However, it was not possible to identify which of the detected corners was the target feature point.

Corner Detection Method Specialized in Feature Detection of Proposed Target
It was possible to capture the target feature point candidates using the Harris corner detection method described in the previous section. However, it was not possible to reliably identify the landmarks as feature points. Therefore, as shown in Figure 5, the proposed method obtains the target feature points by associating the feature points obtained by Harris corner detection with the feature points obtained by Hough transformation.

Canny Edge Detection
The Canny edge detection method is often used to extract the contour of an image, and it performs noise removal, contour extraction, non-maximum suppression, and hysteresis threshold processing in sequence [26]. As a feature of Canny edge detection, it is possible to reduce the number of edges that cause noise using two thresholds, as well as the number of omissions and false detections of contours.
The input image is smoothed using a Gaussian filter to calculate the differential of the smoothed image. Therefore, the contour is detected by calculating the magnitude and direction of the gradient from the differentiated result, performing non-maximum suppression processing and hysteresis threshold processing.
The Gaussian filter is a filter used to smooth images. Natural smoothing can be performed by using the weighting of the pixel values around the pixel to be processed and the Gaussian distribution.
where & is the smoothed image, is the input image, and ' ( is the kernel of the Gaussian filter. The obtained smoothed image is differentiated using the Sobel filter. The images of the horizontal differential & and vertical differential & of the Sobel filter are expressed as where the magnitude and direction of the gradient are calculated from the differential image. From the differential image &, the magnitude of the gradient |&| and direction of the gradient * are expressed as To simplify the implementation of the thinning process, the contour of the differential image |&| by non-maximum suppression processing is subdivided. Then, the thinning process is performed to supplement the lines. Finally, hysteresis threshold processing is performed, and reliable and unreliable contours are selected from two thresholds (maximum and minimum). The evaluation of contour reliability is presented in Table 2.

Hough Transform
The purpose of the Hough transform is to perform groupings of edge points into object candidates by performing an explicit voting procedure over a set of parameterized image objects [27]. The Hough transform is executed using the image adaptation obtained by Canny edge detection to detect the straight lines in the image. As shown in Figure 6, a straight line can be represented by two parameters 3 and shows as In Figure 6, A is the point before the Hough transform. As shown in Figure 7, the straight lines passing through point A can be represented by 3 and *. Because of the symmetry of the figure, * is considered only as (0 ? * ? @). In addition, 3 is never longer than the diagonal of the input image. There are as many curves in Figure 7 as the number taken up by the Hough transform. In other words, point B where many curves obtained intersect is a straight line where a set of 3 and * strongly predicted from multiple points of the input image is detected.

Verification Experiment
In the experiment, feature points were detected based on the proposed method using 640 × 480 pixel input images obtained from a smartphone camera. As shown in Figure 8, the input image is converted to grayscale.     In these figures, the positions surrounded by the white circles are detected as corners. Harris corner detection enables the detection of corners, but it can be observed that corners other than the target feature points have also been detected. Therefore, we focused on the geometric features that appear near the corners and narrowed down the target feature points. For this reason, as shown in Figure 11, noise is removed from the grayscale input image by Canny edge detection, and the Hough transform is performed to detect straight lines in the image. Then, as shown in Figure 12, the intersection of the lines obtained from the Hough transform is located, and the target feature points are among these intersections. where In addition, only intersections with an angle * ( 0°? * ? 90°) between straight lines of 45° or more are considered to prevent the detection of unnecessary intersections as follows: where X % and X are the inclinations of the straight lines and are expressed as To the images obtained using the Hough transform, the above is applied to specify the intersection points of the lines. The identified intersection points are represented by green circles as shown in Figure 13. As a result, candidates for multiple feature points can be obtained from geometric features. However, it was confirmed that there are many candidates for feature points and points other than corners are also recognized. The above is applied to the results obtained from Harris corner detection and the results are shown in Figure 14. Based on the above results, it was confirmed that the target corner was recognized as the feature point by combining Harris corner detection with data obtained from the Hough transform.

Correspondence Between Two-Dimensional and Three-Dimensional Points
As described in the previous section, image processing was performed using the image data obtained from the camera. The experimental result confirmed that the image coordinates obtained in pixel units of the target feature points in the image were obtained. However, the position in the experimental environment cannot be confirmed by the pixel-based image coordinates on the image. Therefore, this section describes the process of obtaining three-dimensional position coordinates in millimeter units from these image coordinates.  The camera's posture, considering that it will cover subsequent three-dimensional composite paths, is adopted by the positioning method by Euler angles. Figure 15 presents the posture expression by Euler angles, and the direction of rotation is clockwise with respect to the positive directions of the X-axis, Y-axis, and Z-axis. The Euler angles are a method of expressing the posture of a rigid body as Y, Z, and [. In addition, there are various expressions for Euler angles such as ZXZ Euler angles and XYZ Euler angles, but herein we use ZYX Euler angles. Then, the ZYX Euler angles are rotated once around the Z-axis, as shown in Figure 15 (a), once around the Y-axis, as shown in Figure 15 (b), and once around the X-axis, as shown in Figure 15 (c). This can be associated with roll angles, pitch angles, and yaw angles depending on the camera position.
Then, the ZYX Euler angle is indicated a rotation matrix . The rotation matrix rotating by angles Y, Z, and [, around each axis of Z, Y, and X, denoted as \ , ] , and ^, respectively, can be expressed as Then, using the above equation, the rotation matrix of the ZYX Euler angles can be expressed as where it is represented by c d cos Y , f d sin Y , c e cos Z , f e sin Z , c g cos [ , and f g sin [ for simplification.

Perspective Projection Transformation
In the perspective projection camera model, the point D i in the world coordinate system is rendered as the point D j in the image coordinate system using a smartphone camera, as shown in Figure 16. The point D i in the world coordinate system is placed as an arbitrary three-dimensional point viewed from the camera located at the origin O 7 of the camera space, and the point D j in the image coordinate system is an image of point D i represented by the image coordinate system. The point D j represents the pixels of the image captured by the camera, which are formed by intersecting the optical center of the camera with the rays from the point D i passing through the image plane. When the point D i in the world coordinate system is represented by the point D j in the image coordinate system, it is expressed as where denotes a 3 × 3 rotation matrix, t denotes a 3 × 1 smooth matrix (external parameter), s denotes a camera matrix (internal parameter), and r denotes a constant. The external parameters represent the conversion from the world coordinate system to the camera coordinate system, and the internal parameters represent the conversion from the camera coordinate system to the image coordinate system.

External Parameters
The external parameter is a matrix that converts the world coordinate system into the camera coordinate system. If the world coordinate point viewed from the world coordinate system is D i m i n i o i v and the camera coordinate point viewed from the camera coordinate system is D P m P n P o P v , it can be expressed as was used to express the camera's posture and position. In addition, the camera attitude can be expressed using the three degrees of freedom of the Euler angle. Originally, the Euler angle has a singularity, so there are postures that cannot be uniquely expressed by only one Euler angle. However, in this study, feature points are detected on the ceiling surface, and only the yaw angle direction is considered, thus it can be expressed.

Internal Parameters
The internal parameter is a matrix that converts the camera coordinate system into the image coordinate system. To explain this, it is necessary to clarify the conversion from the camera coordinate system to the normalized image coordinate system. The normalized image coordinate system is a coordinate system that considers an image plane placed at a reference position at the center of the image plane.
From Figure 17, the normalized image coordinate point D • can be expressed as The actual perspective projection point D • is observed at a shifted point D • K under the influence of lens distortion, as shown in Figure 18. This change is calculated based on the following lens distortion model: • K where ' is a parameter that expresses the degree of distortion, % , , and x represent the strain in the radial direction, and € % and € represent the strain in the circumferential direction. The point D • K in the normalized image coordinate system expresses the position with the center of the image plane as the origin. In contrast, in the image coordinate system, the upper left side of the image is used as the origin and expressed in pixel units, as shown in Figure 19. The point D j in the image coordinate system can generally be calculated as follows: where s is called the camera matrix as described at the beginning of this section, (R , R ) represents the focal length in pixel units, and ( F , F ) represents the optical center (principal point) in pixel units. Based on the above, it can be expressed as follows:

Estimating of Internal Parameter Using Camera Calibration
Calibration of the camera is used to estimate the focal length of the camera, deviation of the center of the image, and distortion of the lens. As shown in Figure 20, by using a chessboard of a plane with known coordinate information, internal parameters related to the camera are obtained using the correspondence between the two-dimensional points and three-dimensional points as follows.
Based on the internal parameters s and ' estimated above, the prepared image (640 × 480) is corrected, and the result is shown in Figure 21. From Figure 21 (b), it can be confirmed that the image is calibrated.

Experiment in Real Environment
In this study, the goal of the bridge inspection robot's location estimation is to recognize the position of the feature point necessary for location estimation as three-dimensional coordinates, and thereby improve the reliability of location estimation. Therefore, by combining the existing Harris corner detection method and the Hough transform to reinforce the problem, the feature points located in the camera shooting direction is detected, and the position is specified as three-dimensional coordinates. To investigate whether the results described in the previous section can be reproduced in the actual environment and how reliably they can be obtained, the experiment was conducted in the environment shown in Figures 22 and 23.  As shown in Figure 22, this experimental environment has been used for a robot running on the ceiling surface, where the rotation matrices around m i , n i , and o i are defined as ^, ] , and \ and represent the pitch angle, yaw angle, and roll angle in the camera, respectively. In addition, as shown in Figure 23, the angles Y and [ in the pitch and roll directions are treated as zero, and thereby, the rotation matrix can be expressed as c e 0 f e 0 1 0 f e 0 c e h. (24) In this experiment, by providing initial conditions, the position of the target feature point D i was converted from the image coordinate system to the world coordinate system and compared with the position of the actual feature point. Regarding the conditions of the experimental environment, the calculation was performed with Z 0 (rad), m i 200 (mm), and n i 35 (mm). In addition, coordinates o i on the Z-axis at the feature point D i were considered at five locations: 500, 600, 700, 800, and 900 mm. By image processing using the method proposed in the previous section, the coordinates of feature points in pixel units in the image coordinate system were detected based on the image at each position.
As described above, experiments were conducted for the detection of each feature point at each position of o i , and the results are shown in Table 3.
The pixel coordinates ( K , K ) of the obtained image coordinate system are compared with the pixel coordinates ( , ) of the feature points of the actual coordinates in the image coordinate system, and the results are shown in Figure 24. In the figure, the red circle is the feature point in real coordinates, and the blue circle is the position of the feature point obtained from the image processing. The feature points D i K in the world coordinate system obtained by image processing can be calculated as shown in Table 4. The comparison of the coordinates of the feature points (m i K , n i K , o i K ) obtained by calculation and the coordinates of the actual feature points (m i , n i , o i ) are shown in Figure 25.  The error distances between the two feature points obtained by comparing the position of the feature point acquired by the image processing with the feature point of the actual Driving of Mobile Robot in the Linear Infrastructure coordinate in the image coordinate system and world coordinate system are shown in Table 5.
The experimental results demonstrate that we were able to identify the corners from the images captured by the smartphone camera and detect the feature points from them. In addition, by converting the pixel coordinates of the feature points in the detected image coordinate system into world coordinates, it was confirmed that the position of the feature points was specified as coordinates. The results of comparing the feature points identified from the respective positions with the feature points of actual coordinates indicated an error of 7.774 pixels on average in the image coordinate system and 31.837 mm on average in the world coordinate system. However, in the lower part of the bridge, which is the actual experimental environment, the distance between the feature points is within 100.0 mm, and if the error is within 21.3 pixels, it can be recognized as the same. Therefore, the results of this experiment were within the acceptable error range. Therefore, it was confirmed that the method proposed in this study could be useful.

Conclusion
In this study, image processing was performed using a smartphone camera, and corners were detected as feature points. Based on the data obtained, and the feature points of the actual coordinates, a quantitative evaluation was performed.
The image processing was performed using Harris corner detection, which is a conventional corner detection method. However, Harris corner detection has problems, including erroneous recognition of other feature points with luminosity changes such as scratches and erroneous recognition of oblique lines. Therefore, to overcome these problems, we focused on geometric features such as the corners of the structures and attempted to capture the target feature points. For this purpose, straight lines were detected by acquiring the contours using the Canny edge detection method and performing the Hough transform.
From the perspective projection transformation of the perspective projection camera model, the relational expression between the world coordinate system and the image coordinate system was calculated. By performing perspective projection conversion on the pinhole camera model using a smartphone camera, the relational expression between the world coordinate system and image coordinate system was derived, and the coordinates obtained by image processing and actual coordinates were compared. As a result of comparing and evaluating the estimated and actual coordinates, it was found that the proposed method showed high accuracy with an error within the acceptable range.
In the future, the shooting position of the camera will be estimated using the robot position information obtained from the LiDAR mounted on the bridge inspection robot BIREM-IV-P. In addition, the coordinates of the feature point detected from the position will be combined with the feature point information from the LiDAR sensor. Thus, errors in the position coordinates of the feature points can be reduced, and the accuracy of localization can be improved.