Relationship Between Urban Road Traffic Characteristics and Road Grade Based on a Time Series Clustering Model: A Case Study in Nanjing, China

With the increasing number of vehicles in large- and medium-sized cities challenges in urban traffic management, control, and road planning are being faced. Taxi GPS trajectory data is a novel data source that can be used to study the potential dynamic traffic characteristics of urban roads, and thus identify locations that show a notable lack of road planning. Considering that road traffic characteristics on their own are insufficient for a comprehensive understanding of urban traffic, we develop a road traffic characteristic time series clustering model to analyze the relationship between urban road traffic characteristics and road grade based on existing taxi trajectory data. We select the main urban area of Nanjing as our study area and use the taxi trajectory data of a single month for evaluating our method. The experiments show that the clustering model exhibit good performance and can be successfully used for road traffic characteristic classification. Moreover, we analyze the correlation between traffic characteristics and road grade to identify road segments with planning designs that do not match the actual traffic demands.


Introduction
In the cities in China, roads have come under increasing traffic pressure in recent years with the rapid increase in the number of private cars; this results in many traffic problems. This situation is particularly critical in large cities. Therefore, traffic characteristics of an urban road are important to be studied as this would enable the traffic management department to grasp road traffic conditions, plan road usage scientifically, and relieve traffic pressure on urban roads. This study focuses on the traffic characteristics of urban roads, which are particularly important for urban planning and traffic monitoring. We use the average speed of each road segment over different periods to measure the condition of road traffic, classify road traffic characteristics, and then analyze their correlation with the road grade.

Related studies on taxi trajectory data
Taxis, which comprise a significant proportion of city traffic, can be abundant sources of information, because their GPS devices usually record data such as coordinates, speed, direction, time, and whether passengers are being transported. In addition, the taxi trajectory itself contains massive information that can be mined for additional information related to various aspects. In recent years, research based on taxi data has included exploring the structure of urban areas, mining city traffic information, exploring human activity patterns, and mining taxi driver experience. Liu et al. (2009) examined the feasibility of using taxi dispatch systems as probes for real-time traffic surveillance. Qi et al. (2011) identified the social function of different regions of a city by using taxi tracking data, and found that the number of passengers travelling to and from a region can reveal the dynamics of social activities in that region. Liu et al. (2015) attempted to reveal travel patterns and city structure by using taxi trajectory data. Cui et al. (2016) discussed the accessibility of urban road networks. Zhan et al. (2013) estimated the hourly average of urban link travel times using taxicab origin-destination trip data. Chen et al. (2011) put forward a space-time GIS approach based on time-geographic concepts for exploring activity diary data in space and time. Zhuang et al. (2012) proposed an empirical route-planning framework that explores driver experience in terms of route choices to establish a database of experienced routes. In addition, researchers have explored updating road networks, traffic prediction, and simulation by using taxi trajectory data (Lee et al., 2009).
However, taxi data are affected by the precision of GPS, and it is difficult to directly locate the sampling point accurately. In order to solve this problem, map matching needs to be performed. A summary of 35 different map matching algorithms has been published (Quddus et al., 2007). Each algorithm is applicable for a unique scenario; thus, a universal algorithm capable of solving all types of related problems does not exist yet. Owing to the complexity of methods that are based on statistical theory, such as those using Kalman filter, fuzzy logic, and Bayesian inference, the overall process becomes too complicated even though these methods improve the matching accuracy of the classical map matching method. Thus, we adopt the more common matching method known as 'point-to-line'.

Traffic information time series
Many methods for time series clustering have been proposed; these can be categorized into three types: raw-data-based, feature-based, and model-based approaches (Liao, 2005). Methods that cluster raw time series data directly are known as raw-data-based methods, such as those proposed by Košmelj and Batagelj (1990), in which a relocation clustering procedure developed for static data are modified. Feature-based and model-based approaches first convert raw time series data into a feature set or a set of model parameters, and appropriate clustering methods are used. Considering that the dataset used in our study is small, and therefore, building a complex model is unnecessary, we chose the feature-based approach. Compared with the raw-databased approach, the feature-based approach can select the most appropriate features according to different application scenarios.
Many research studies on traffic network analysis based on time series models have been reported. For example, Zhang et al. (2007a) proved that the average value of historical data considering error is a practical method to estimate travel time and speed state. Stathopoulos and Karlaftis (2011) used spectral and cross-spectral analyses to study urban traffic flows. Ghosh-Dastidar and Adeli (2006) and Vlahogianni et al. (2008) used advanced wavelet technology to analyze the time series of traffic flows. Subsequently, Vlahogianni and Karlaftis (2012) considered the weather as a factor and compared the travel speed time series of a freeway in various weather conditions using a recurrence-based complexity measuring method. Lippi et al. (2013) compared and summarized existing short-term traffic flow forecasting methods and models, and designed a new supervised learning model based on support vector machines for traffic flow forecasting.
At present, the use of taxi trajectory data to analyze short-term traffic flows for forecasting and traffic incident detection has been comprehensively studied; however, only a few of these studies have considered clustering of road speed time series. In some studies on urban traffic, researchers directly use raw time series data or raw features of time series for clustering analysis, without considering redundancy and potential correlations among these features. In some other areas of research, principal component analysis (PCA) combined with K-means clustering was found to produce more optimal clustering results. For example, Ding and He (2004) proved that PCA-based dimension reductions are particularly effective for K-means clustering. Filho and Maia (2010) used PCA and K-means clustering to predict short-term traffic by using PCA as a dimen-sion-reduction technique and a K-means-based local linear model as a method for prediction. However, the average speed time series of urban roads has been insufficiently studied. We apply a two-step clustering approach to classify urban road traffic characteristics. First, the extracted road speed time series features are used to remove correlation, that is, PCA ensures that the newly obtained features are independent of each other, and then K-means clustering is performed.
Some studies have considered urban road grade. Toplak et al. (2010) used functional road class (FRC) as a reference to classify roads to predict road traffic time. Through the case study of road system in Shenzhen City, Zhang et al. (2015) concluded that the road grade has a high impact on the traffic capacity. In addition, high-grade road capacity changes a lot within a day, but low-grade road capacity is more stable. In general, research findings on urban traffic analysis, modeling, and prediction using taxi trace data are abundant. Taxi trace data have become an important source for research on transportation and urban geography. However, mainstream research all over the world is mainly focused on the simulation or prediction of real traffic, while static information related to the road itself has been ignored so far. There is even less research on the correlation between road grade and traffic characteristics. Our study attempts to address this lacuna.
We consider the road network within the main urban area of Nanjing as an example. A two-step clustering model ( Fig.1) is applied to classify the road traffic characteristics. Various traffic characteristics of urban road can be detected at low cost and high speed. Combined with the road grade, the implied correlation rules between traffic characteristics and road grade are revealed, which can provide deeper insights into urban road traffic. They can provide a scientific reference for city planning and optimization of city traffic management. Our research can also provide important reference information that has the potential to improve the efficiency and quality of public travel.

Study area and data acquisition
Nanjing City, the capital of Jiangsu Province, is categorized as a National Central City according to the new National Urban Planning Register. The status of Nanjing is second only to Shanghai in Yangtze River Delta Urban Agglomerations (Fig. 2). Nanjing has a population of approximately 8.23 million (November 2015). There are as many as 2.15 million motor vehicles (June 2015) in Nanjing, which shows an increase of 8.6% in the past six months. The rapid growth of population and increasing number of vehicles make it difficult to meet the traffic demands with the existing roads. Therefore, it is necessary to study the characteristics of the changing traffic conditions for making traffic decisions. We chose main urban area of Nanjing, with an area of 254.75 km 2  surrounded by the outer ring highway and Yangtze River, for our case study. The road network in Nanjing is highly complicated; it includes expressways, urban expressways, trunk roads, sub trunk roads, and access roads.
Original taxi trajectory data were collected from 10 013 taxis within one week of trajectory information being recorded. The source data for licensed taxis of the city were obtained from the Nanjing traffic management department that owns and operates a taxi supervision system for real-time data. Taxi data are recorded every minute and a total of about 10 million records are generated each day. In addition, incremental data are obtained at 5-min and 30-s sampling intervals to produce a total of nearly 1.5 billion records. The data format, as listed in Table 1, contains information about the position, speed, and direction of a vehicle, passenger status, and a time stamp. Furthermore, the navigation road network data of Main urban area in vector format is also used as basic experimental data.

Data preprocessing
The raw data collected from taxis are first subjected to cleaning and filtering. The speed anomaly will affect the calculation accuracy of the average road speed; the direction anomaly will affect map matching. Therefore, speed values and direction angles beyond the normal range should be eliminated. In addition, we only selected records for working days and when passengers were transported in taxis. This was done because when taxis carry passengers, the drivers usually try to reach their destinations as soon as possible (without exceeding the speed limit) in order to increase turnover. Under this condition, the positioning information is more likely to reflect the real state of road traffic (Zhang et al., 2007b).
The spatial expression of road elements has different forms under different environments and different levels of abstraction. The taxi trajectory coverage of the road network would not be optimal if we were to include all possible roads; thus, in this study, we simplify abstract the original road network structure by eliminating internal campus roads, commercial pedestrian streets, residential streets, and other streets that have no relevance to the scope of this research. The road format in our model is a node-arc data structure, where a road intersection is represented by a node and some arcs as described in Fig.1. The simplified local road network is shown in Fig. 3, which uses road grades to distinguish them. The case study included a total of 2194 roads in Main urban area. The road grade counts are listed in Table 2.

Average road speed extraction
The average speed of one taxi on the road is the arith- where i v is the average speed of taxi i, k is the sample number in time interval T within road interval L and v i, j represents the jth sample speed of taxi i. Next, we calculate the average mean speed of all taxis as the average road traffic speed.
where n is the number of taxis on the road in time interval T and x v represents the mean speed of the xth taxi.
Taxi data recorded between 5:00 am to 24:00 pm were selected for our study, and we set time interval T at 30 min in accordance with the large amount of data that was available. However, if the sampling time interval of the original taxi data is longer than 1 min or the amount of original data is less, time interval T can be increased suitably. The road sections are divided according to the intersection points along the road. If the number of taxis in a 30-min interval on a road is less than three, this road section is excluded from the average speed calculation and from the clustering as well.

Feature extraction
To a certain extent, the shape feature can show different time series characteristics, and such features can usually be observed directly. A local shape-based clustering of time series can find a cluster of the same time series with a similar shape, usually used to describe a shorttime series (Balasubramaniyan et al., 2005). Fig. 4 shows the speed time series of one road in main urban area on weekdays; the speed time series of many roads have similar shapes. In Nanjing City, the time when students travel to school or people commute to work is generally between 7:00 am and 9:30 am and generally, people return home after work or school between 4 pm until about 7:00 pm. The influence of early and late peak hours of a working day on road speed is more obvious whereas the average speed of periods outside peak hours is relatively stable; moreover, the time series is short. Therefore, it is not appropriate to use too many features as noise may be mistaken for characteristics of the time series. Therefore, this model assumes that the following five segmentation features are included in the candidate feature set: the rate of change of the road speed in the morning and evening rush hours and the average speed of the remaining three stable periods (i.e., the period Fig. 4 Average speed time series for one typical road. The five segmentation features include the rate of change of the speed in the morning rush hours (F1, between 7:00 am and 9:30 am), the rate of change of the speed in the evening rush hours (F2, 4:00 pm and 7:00 pm), the average speed between 5:00 am and 7:00 am (F3), the average speed between 10:00 am and 4:00 pm (F4), and the average speed between 8:00 pm and 12:00 pm (F5) preceding the morning rush hour, the period between the morning and evening rush hours, and the period following the evening rush hour, as shown in Fig. 4).
The five variables above are chosen as the feature set   1, 2, 3, 4, 5 F F F F F of one road. The formulas expressing the rate at which the speed changes during the rush hour and the average speed of the stable periods are given below.
(1) speed change rate during rush hours: (2) average speed of stable periods:

Principal component analysis
PCA is a dimension reduction algorithm, which can extract data, remove redundant information, highlight hidden features, and reveal the main relationship between observations (Serrano-Cinca et al., 2005). The PCA algorithm can be used to convert a set of variables, which may be correlated, to a set of linear uncorrelated variables by orthogonal transformation. After transformation, this set of variables is known as the principal component. Research has showed that the PCA transform can replace original multidimensional data with fewer components on the premise that information loss is minimal (Townshend et al., 1987). The significance of using PCA in our model is that the principal components are linearly independent; hence, the distance between clusters is increased, and the distance within each cluster is reduced. Therefore, PCA can find a set of optimal features to express data in a more compact manner, which can further improve the effect of succeeding clustering. Two points need to be addressed when using the PCA algorithm in our model. First, PCA is very sensitive to data scaling. As the various features have different quantity ranges, we first need to standardize the feature set by removing the unit constraints of data to transform the data into numerical values to allow comparison of different units or orders of magnitude. Second, as an unsupervised learning algorithm, the input parameters of the algorithm need to have the desired dimension, i.e., d. The dimension should be determined by the explained variance ratio of each principal component. In general, the cumulative explained variance ratio of the first n principal components reaches the predefined threshold (e.g., 95%), and n can be used as the expected dimension d.
The flow of the PCA algorithm used in our model is as follows: (1) The standard score (also known as the z-score or zero-mean normalization), which can be written as i x , is used to ensure that the original characteristic data meet the standard normal distribution where the mean value is 0 and the standard deviation is 1 (Table 3).
(2) The covariance matrix T XX of the normalized sample set, which is a 5  5 matrix in this model, is computed.

K-means algorithm
In this step, the K-means method is used to cluster a new feature set   1 2 , ,..., d F F F     that has been transformed by PCA. K-means clustering is a commonly used unsupervised clustering method, which is a prototype-based clustering algorithm; it assumes that the clustering structure can be characterized by a set of prototypes or representative points in the sample points (Kim and Krishnapuram, 1996). K-means clustering attempts to cluster the dataset into k groups by minimizing a criterion known as inertia or the sum-of-squares distance within a cluster. In other words, we use K-means clustering to ensure that the distance from one point to another point in the same cluster is shorter than the distance to a point in any other cluster by minimizing the square error of the cluster result   1 2 = , ,..., k C C C C : The number of clusters k is a positive integer that needs to be set in advance; the method that is often used to determine the value of k is an iterative operation of the K-means algorithm (set k = 2, 3, 4, …). An appropriate value of k is selected according to the change point (possibly more than one) of the rate curve of the sum of the square error of each k value.
In this model, the K-means algorithm is as follows: (1) Sample set   (2) k samples are randomly selected from D as the initial mean vectors  1 2 , ,..., k μ μ μ .
(3) The distance from each sample x to the mean vectors   is calculated. The cluster mark of x will be determined according to its nearest mean vector, and x will be classified into the corresponding cluster.

Model evaluation
The case study presented in this paper focuses on the road network of Main urban area. After the original time series feature set is processed by PCA, the first three principal components are shown to explain more than 98% of the variance of the original data (Table 4). Thus, the value of d is set to 3, and the dimension of the original feature set is reduced to 3 by multiplying the first three columns of the projection matrix. The average speed time series of one road segment is represented by one vector consisting of three features, which are linearly independent. The redundant and linearly related parts of the original feature vectors are removed.
Three new feature vectors are obtained from five vectors after determining the dimension and performing PCA analysis (Table 5). Among them, 1 2 5 , ,..., x x x represent the standardized vector composed of the corresponding features of each road. 1 2 3 , , x x x    are the resulting eigenvectors, which are linearly independent. All 2194 3D vector samples, such as Road_1 (5.2059, −2.7650, −0.1514), are included in the sample set for K-means clustering, and the number of clusters is iterated to select an appropriate value of k. As shown in Fig. 5(a), the sum of the distances from each vector belonging to the same cluster to the respective cluster mean vector changes for different number of clusters, whereas Fig. 5 (b) shows the slope of the sum of distance (E) with the number of clusters changes. The boundary value of the slope curve enables the appropriate number of clusters to be found. In this case, the alternative k values are 5, 7, 9, and 11. A comparison indicated that the result would be more interpretable if the roads were to be divided into seven clusters. The clustering results are shown in Fig. 6, where each point represents one road section with three-dimensional coordinates comprising the three feature components after PCA dimensionality reduction. Points with the same color belong to the same cluster. After averaging the speed of road segments within the same cluster at different time intervals, we obtain the road speed time series of the seven clusters, as shown in Fig. 7.
Because the principal components of the dataset obtained after PCA analysis have no physical explanation, we calculated the original feature set of these seven types of road average speed time series, as presented in Table 6. Overall, the rate of change of the road speed in the morning rush hour is generally higher than that in the evening rush hour. This result shows that in main urban area, resident trips are more focused in the morning rush hour. We found that the traffic characteristics of the three types of clusters, types I, II, and VII, are similar. Their speeds are between 20 km/h and 35 km/h; in other words, their road grade is not high. The differences among the three types of clusters are mainly reflected in the changes during rush hour. The change in type I is not obvious, with the smoothest change during the two daily rush hours. Type II shows the most significant change  during the morning rush hour. Meanwhile, during the evening rush hour, type VII has a notable change rate, and there is an obvious step slip at half past four in the afternoon, which is quite different from the other two types with smooth transitions. Type V roads exhibit the slowest speed (under 20km/h for almost the entire day) among all types; the impact of morning rush hour is the smallest and the change rate is not so obvious during the evening rush hour. This shows that the reason for bad traffic conditions is not a traffic jam but the low road grade. Type VI roads are the fastest of all, which shows that it has high road grade. Meanwhile, its influence degree of the morning rush hour is in the middle position, and the speed change is not obvious during the evening rush hour (7.31%). The traffic speed of type III, which also has high road grade, is the second highest after type VI. During the two daily rush hours, the speed change rate is the highest for all types. We think that such roads are the most easily congested road sections in urban areas. The reconstruction of this type of road may significantly improve the traffic situation in the city. The speed of type IV can be almost maintained at 40km/h. Compared with other types, its change rate is not obvious, and the speed change rate is most stable during the evening rush hour. This reflects that the traffic during working days is ideal and more stable.
Types III and VI cluster results (Fig. 7) revealed some characteristics that we did not consider. Originally, noon was considered a stable period; however, there exists an obvious step-wise reduction between 12:00 am and 6:00 pm. Moreover, after the evening rush hour, the speed curves begin to increase.

Correlation analysis of road traffic characterristics and road grade
Road grade is set according to traffic demand and relevant regulations by the urban road design planning department. All roads belonging to a particular grade have similar functions and design speed. Ideally, same grade roads should exhibit the same traffic characteristics. The grade should also be able to meet the actual traffic  demand. According to different road grades, we performed calculations for seven types of roads obtained by the clustering model (Table 7). The spatial distribution of the seven types of road sections is superimposed with the road network attached with the grade label (Fig. 8). From Table 7, we can see that the actual traffic characteristics of roads are highly correlated with the road grades. Expressways have higher type VI characteristics because 83.88% of these road sections is composed of type VI roads, which can maintain a high speed of around 70 km/h for the entire day; 11.89% of road sections exhibit type III characteristics with a speed of under 70 km/h and are affected seriously by rush hours. It can be found that there are two expressway sections that belong to type III in the south of the study area, as shown in Fig. 8(c). They connect some urban expressway sections, which imply that they are important routes during the rush hours on workdays.
An urban expressway is designed as a fast road for vehicles only, which is divided by an isolation belt. Its function is to connect the main urban districts, central urban area, and foreign satellite towns. The design speed lies between 60-100 km/h. It is mainly composed of type III (41.95%) sections, which allows traffic conditions to meet the design standard. 13.28% of the urban expressway has type VI (expressway) characteristics. As can be seen in Figs. 8(c)(d), most of urban expressways show type III characteristics, this implies that the phenomenon of 'tide' traffic is obvious. These sections bear the heaviest traffic pressure during rush hours (with maximum speed change rate) in main urban area of Nanjing. At the same time, 29.37% of the sections are of type IV standard, which do not meet the design expectations, including some sections of the inner ring (some of the south-north tunnels such as the Jiuhuashan Tunnel in the east).
A trunk road is the road network framework of a city, because it connects industrial zones, residential areas, and stations. It is a critical road that undertakes the main traffic task of the city with the designed speed of between 40 and 60 km/h. It is surprising that only 39.91% of trunk roads meet type IV standard (speed above 40 km/h during the day). A large proportion (nearly 44%) of the trunk road sections are of type I, type II, or type VII standards. All three types are characterized by low average speed and poor traffic conditions. Thus, nearly half of the trunk road sections in the main city do not meet the design expectations. As seen in Figs. 8(a)(b)(d), the trunk roads outside the inner ring expressway ( Fig. 8(d)) mostly meet the traffic demands. A part of the trunk roads with better traffic conditions is distributed in the southwestern part of the study area, namely the 'Hexi CBD' area in Nanjing. The density of residential buildings in this area is relatively low. Besides, the planning of roads here was done later than in the central area. In the internal loop of Nanjing (central area), most trunk road sections present two characteristics, types I and II. In particular, the two trunk roads through the downtown business district of Xinjiekou have poor traffic conditions on weekdays.
The sub trunk road is an ordinary traffic road in the urban area, with main functions of regional traffic and service. They form a road network with trunk roads, which are widely connected with all districts as well as used to disperse trunk road traffic pressure. According to Table 7, we find that 80% (I, II, V, and VII) of the sub trunk roads show lower speeds (almost 30 km/h all day), failing to achieve sub road design expectations. Sub trunk roads with poor traffic conditions are mainly distributed within the inner ring roads and within the Gulou District on the northwestern side, out of the inner ring.
An access road is the connecting road between a sub trunk road and a residential road, which resolves the traffic in some local areas and performs service functions. The designed speed of access roads is mostly between 20 to 40 km/h. According to Table 7, 57.49% of the access roads in the main urban area contain type V sections, where the speed remains under 20 km/h after 6:00 am, which does not meet the design speed. It can be observed from Fig. 8(e) that most of the access roads in the central urban area are of type V standard. These sections tend to be distributed between residential areas in the Gulou District and Xuanwu District, where residential buildings were built early with high people density. Urban residents have to travel through these routes to go to work. The remaining access roads are composed of types I, II, and VII (accounting for a total of 42.51%) sections, where the actual traffic conditions meet the expected design. These sections of roads are mostly distributed outside the central city ( Fig. 8(a)(b)(g)), such as the Jianye District, on the west side of the inner ring roads, where the population density is relatively low compared with to Gulou District in the central urban area. In addition to the correlation between the road grade and traffic characteristics, we found that some points of interest (POI) have a significant impact on the road traffic conditions. This is sometimes the key reason for actual road characteristics not matching the road grade. For example, the impact of large hospitals on the trunk and sub trunk roads is relatively large (Fig. 8(b)). The speed of the road sections adjacent to a large hospital is obviously lower than the designed speed. The road connected with the Nanjing Railway Station is a part of an urban expressway. However, owing to heavy traffic and frequent activities that involve passengers getting on and off, the actual speed is lower than the designed speed for the section. The adjacent sections show the type I characteristics (the northeast corner of Xuanwu Lake, showed in Fig. 8(a)). The common features of the POI above are reflected in two aspects: high number of visitors, and consistent situation throughout the day.

Conclusions
Using the large-scale taxi tracking data, this study proposed a simple two-step time series clustering model for classification of city road traffic characteristics within Nanjing, China. This model can classify the characteristics of road traffic effectively just through simple steps without requiring heavy computation, and then, find the correlation between the characteristics of roads and road grades. Some main conclusions drawn are as follows： (1) There is high correlation between the road grade and the traffic characteristics, which confirms the deduction by Zhang (Zhang, 2015). (2) The speed change rates of roads do not increase as the road grade becomes higher. In fact, the same road grade sections exhibit varying traffic characteristics. These discrepancies are due to the population density, type of land use, economic level and even some important POI (such as hospitals and railway stations) in different regions. (3) Except expressways and urban expressways, more than half of all other road grades of urban areas fail to meet the design standards of actual traffic conditions, in particular, in the central area of Nanjing; this demonstrates that the road grade structure of urban roads is not appropriate and the branch system is not perfect. In the event of major road construction or emergency situations, the lack of diversion roads may have a significant impact on traffic flow.