Connected Vehicle

Data Science applications for automotive data

Vehicles have increasingly been transforming into complex digital systems, generating tremendous volumes of high frequency data. With greater uses of sensors, connectivity, GPS and driver aids such as ADAS systems and the gradual shift towards autonomous vehicles, the generation of data from each vehicle is expected to grow exponentially. 

The volume of data from vehicles can range from a few kiloBytes an hour of driving when a vehicle is tracked at a few minutes apart to tens of megaBytes an hour when high frequency vehicle sensor data is transmitted. This presents two challenges – one, transmitting such large volumes of data and, two, analysing the data to find patterns in such large data volumes. The first problem can be addressed through some level of edge computing using pre-trained models deployed on onboard devices capable of processing this data. This economises data transmission by sending only data which requires some further attention or which serve as inputs to other models in further steps of analysis. Analysing and modelling the data transmitted provides tremendous value in addressing problems related to vehicle health, safety, efficiency and utilization at the individual vehicle or fleet level as well as in providing alternatives to regulation and enforcement at the regional and city levels. Examples of the value provided by leveraging analytics for such data is presented in this article, some of which are from the experience we have gained in this area. It is through such insights from data that efficiency gains can be realized that cascade down the value chain of the automotive sector.

Vehicle health

One compelling use case for data analysis includes vehicle health status determination using ECU sensor data. A vehicle’s ECUs report multiple parameters that could indicate potential vehicle issues. The operating ranges of these parameters could serve as potential thresholds to detect anomalies, but these are not indicative of poor vehicle health as a certain number of anomalous values could be expected in the normal operation of the vehicle and the parameter values could subsequently recover to within normal operational range. Indeed, diagnostic trouble codes reported by vehicles are based on such operational ranges and frequency of occurrence. 

Establishing parameter thresholds for specific vehicle models while controlling for different ambient conditions and driving states has been an area of interest in which data mining can play a key role

Establishing parameter thresholds for specific vehicle models while controlling for different ambient conditions and driving states has been an area of interest in which data mining can play a key role. Here, data from a fleet of vehicles reporting parameters at periodic intervals, after being cleaned and filtered to remove invalid values and treated for missing values and outliers, can be used to define vehicle sub-system operating regimes and outlier thresholds for specific vehicle models in different ambient conditions. Operating regimes could be defined for individual parameters or for a group of parameters for multiple ambient conditions (such as ambient temperature, pressure, relative humidity, and altitude) as well as for different driving states (idling, acceleration, braking, cruising, etc.). The operating regimes for different ambient conditions would ensure that behaviours and extraneous factors are controlled for and only the variations due to vehicle health are factored. The thresholds for these operating regimes themselves could be parameterised using different multivariate statistical techniques that are distance based (such as Euclidean distance, Mahalanobis distance, etc.) or correlation based or even histogram (distribution) based. 

Since data is pooled from multiple vehicles, the thresholds for each operating regime can be mined or “learned” from historic data for a given vehicle model and continuously refreshed as new data becomes available. Of course, care needs to be taken to ensure that the baselines of each operating regime are estimated after filtering out vehicle data from vehicles displaying abnormal performance. This could be achieved through several unsupervised learning methods. The health of a vehicle could then be determined by comparing that vehicle’s parameters within an operating regime to baseline values. The vehicle’s overall health score could be arrived at by combining scores for each operating regime. 

This framework can be quite challenging and implementing this requires a combination of non-trivial data management tasks at the edge and interactions with a backend system that is continuously mining data from a pool of vehicles and refreshing baselines. The health scoring of vehicles could be designed to either occur at the edge by deploying algorithms on onboard systems or on backend systems that expose the scores for a vehicle through APIs for display on dashboards. This depends on the use cases for which the system needs to be designed – for example, if the need is to provide real-time information on the vehicle’s health on the vehicle’s dashboard, it would be appropriate to deploy the scoring algorithm on the edge and have the vehicle communicate to the backend only to fetch refreshed baseline values at periodic intervals. The scores may also be transmitted to the backend for fleet dashboard displays. On the other hand, if the need is to monitor vehicle health only at the fleet level, the vehicle parameters could have some level of pre-processing to minimize data transmission and have aggregations and transformation done onboard the vehicle. This aggregated data could be used on the backend for scoring and monitoring. 

Driver behaviour and Risk scoring 

An analysis of accidents in the United States shows that excessive speeding was a factor in more than 28% of all accidents (National Highway Transport and Safety Administration, 2014). Hard Braking is another indication of being distracted while driving and a risk factor. 

We have analysed over 85 million miles of driving data in the United States and Canada and modelled the link between speeding, hard braking and rapid acceleration behaviour with real accident records of these vehicles. This has been the basis for a driver scoring algorithm that makes use of a multi-layered data set comprising behaviour data overlaid with weather, time-of-day, magnitude and frequency of events, vehicle class, etc. A model of these driver scores and accidents shows that for a 10-point increase in the Braking score, the risk of an accident reduces by 23.6%. Similarly, a 10-point increase in Acceleration scores reduces risk by 3.3%. A 10-point increase in the overall Driver Score corresponds to a 11.4% decrease in risk. When Exposure Risk is also considered in the overall score (used as an Insurance Score), a 10-point increase in the overall score corresponds to a 48.6% decrease in the risk of a preventable accident. These models show that telematics data can provide an excellent measure of behavioural and exposure risk and can be used by insurers to price products and by individual drivers and fleets to improve driving behaviour to mitigate risk. Monitoring driver behaviour for fleets and providing feedback can significantly minimize accident risk and improve fleet efficiency while usage based insurance programmes that use driver scores to incentivize good driving behaviour can contribute to overall safety improvements in the transportation sector.

Analysing historical driving patterns is another powerful way of assessing risk. Machine learning models of individual driving to predict behaviours based on past patterns of driving can be used to pre-empt risky driving behaviour through incentives or appropriate notifications delivered to the vehicle’s dashboard. 

More recently vision based Advanced Driver Assistance Systems are becoming available as retrofits to alert drivers on traffic speed limits using image recognition and on collision warning or stopping distance estimation using more advanced models. These also offer lane departure warnings as well as distracted driving and drowsy driving detection using pre-trained AI models. Such in-vehicle systems are estimated to decrease the fatalities by up to 16 per cent based on reported crash data from 1999 to 2008 (Source: CASR Report Series, CASR094. Center for Automotive Safety Research. The University of Adelaide, Australia). Video snippets of violations are sent to a cloud repository where these can be reviewed as and when required. 

Fuel economy

Fuel economy is another area that data analysis can provide useful insights on. In India, ARAI typically tests new vehicle models on a chassis dynamometer that simulates driving in India using the modified Indian Driving Cycle (IDC) for emissions as well as fuel performance. These tests provide a fuel economy performance value for each vehicle model but this can vary significantly from actual performance in real-world conditions due to various factors such as terrain, weather conditions, driving behaviour as well as differences in driving patterns. The large volumes of data available from multiple vehicles at a granular level – every second or even at the end of a trip – can be used to estimate fuel economy performance using population based models at the trip level and can offer tremendous value in benchmarking fuel economy for each vehicle make-model. 

For example, using fuel consumption data from over 210 million miles, we have been able to develop such fuel economy benchmarks for over 2500 year-model-make-engine types for multiple geographic regions and seasons. These benchmarks allow us to rate trips against these benchmarks for a particular average trip speed in a region and season and can highlight whether the trip’s fuel efficiency was poor due to poor driving behaviour or due to a possible vehicle health issue. Having such population based benchmarks can help regulators assess the actual fuel performance of vehicles meets prescribed standards and to set standards accordingly. Vehicle manufacturers can also benefit from such data and benchmarks in compliance with policy and regulations. 

Modeling fuel consumption with respect to driving behaviour can also establish a quantitative link between additional fuel consumed as a result of individual driving behaviours like braking and speeding. This can quantify costs related to poor driving behaviour from a fuel perspective. For example, for a particular medium commercial vehicle model, we have been able to relate a single hard braking event to an increase of 0.16 litres of fuel and speeding for 1 minute above 120 kmph to an increase of 0.05 litres. 

Accelerometer data applications

Onboard vehicle devices are typically equipped with accelerometers that have the ability to detect accidents when a certain g-force threshold is crossed. This is a form of edge computing requiring the edge deployment of a classification model to distinguish between real impact events and false positives that are triggered when a vehicle has gone over a large pothole at high speed. Classification models are developed using high frequency data (a minimum of 100Hz) and use various inputs in addition to the accelerometer. Such accident detection algorithms are useful to provide real-time notifications for emergency response in the event of an incident. The data obtained from accelerometers and other sensors like gyroscopes along with other contextual information is also very useful in accident reconstruction as well as for injury classification to provide a better understanding of the event, which can serve as useful input for vehicle insurers.

Accelerometer data can also be used to model and classify road surfaces that a vehicle has driven on. This has applications to determine additional stresses on tyres as well as vibrations that could impact certain vehicle systems. Classifying roads in terms of their quality can also help local governments identify stretches that need maintenance and can provide transparency in the way funds are allocated to different areas under their purview.

Electric vehicles

The shift to EVs has been rapid over the past few years and this shift is expected to accelerate even further over the next few years. In addition to EV health monitoring and driving behaviour monitoring described above, vehicle parameters (voltage, current, temperature from multiple sub-systems) can be used along with ambient conditions and driving styles to model and compute more accurate and reliable estimates of driving range at different States of Charge (SOC) than currently available algorithms. Similarly, the relationships between multiple battery parameters can be used to model the State of Health (SOH) of batteries to indicate whether the battery needs to be replaced. At the system level too, optimizing charging networks and utilities (especially integration with renewable sources) and other infrastructure integration can be better informed through data analysis of driving patterns, resource availability and operations. The integration of EVs with smart grids is another emerging use case that is likely to generate more interest in the future with tremendous applications for the use of vehicle batteries to supply power to the grid during peak demand, to detect network deficiencies, maximize renewable energy uptake for vehicle charging, etc.


This article has outlined several applications and use cases of analyzing and modelling vehicle data. These range from vehicle development efforts of OEMs, improving safety, fuel economy and operational efficiencies of commercial fleets as well as passenger vehicles. Integration of contextual data such as weather and traffic conditions along with behavioural inputs that might be derived from the vehicle data itself or from additional sensors / feeds provide an opportunity for a richer analysis with deeper insights. The EV industry, including EV charging, is in early stages of development. Integrating advanced data analytics and modelling can add tremendous value to a variety of stakeholders across the value chain and can produce innovative business models to monetize such data. The future of the automotive sector within the overall mobility space offers great potential for the future and the first movers harnessing the power of data analytics are likely to benefit greatly in the application and monetization of automotive data. 


Dr. Ashwin Sabapath

Director, Data Science

Danlaw Inc

Ashwin has over 20 years of research and consulting experience and currently heads the Data Science practice at Danlaw Inc. He is responsible for building innovative solutions involving advanced statistical techniques applied to automotive and telematics data and has been leading the development of an analytics platform for the automotive sector integrating multiple services for vehicle health, risk scoring and other sensor based models.

Published in Telematics Wire

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button