Connected Vehicle Data to Traffic Measures

Programmed by: Ken Fukutomi

Tools: Python, Dask, Azure SQL Database, ArcGIS

Skills:

PaginationBig Data AnalyticsOptimization

i. Preprocessing

When initially writing the code for on-road StreetLight data, I found that the sheer size of the raw data was far too large for some native libraries like Pandas to compute the entire script. So I had to explore alternative ways to process the data more efficiently and cost-effectively.

I initially thought of utilizing Spark for this task, since I had worked on another project where I wrote Spark code for graph processing. But I found that Spark wasn't the best fit here. I preferred to write native Python code, with a lighter-weight transition from local computing to cluster computing. That's when I found it easier that Dask helps with parallel computing and integrates well with existing code -- so that was the way to go.

Figure: Example look of initial data processed into the script

As simple as the filtering process may seem, performing a spatial join on over 100 million individual vehicle record points is absolutely unsustainable -- first, with the CPU of my working computer, and second, within the permitted time under a deadline. So in that initial paginated script, we developed an additional heuristic step in filtering based on an H3 cell index, and gathered such data into an Apache Parquet file.

ii. Corridor-Level Analysis

Great, we reached the point where the data was loaded as neat Parquet files. Despite this, each bounding box of the H3 cell still contained a huge amount of information. From here, we filtered further by writing code in Golang to compute whether a vehicle's travel point fell within the shapefile of an actual corridor. For smaller corridor(s), this could've also been handled in Python with a combination of Dask and Dask-Geopandas.

Figure: Corridor-level filtering across H3 bounding cells

Now we had a well-connected graph of corridors situated near major roads, arterials, highways, etc. Much of the remaining computation dealt with determining and validating trips in a time-efficient manner -- using data structures, algorithms, and vector-based computations wherever possible. At this point, the analysis focused on:

** Outlier and anomaly detection, computing corridor-level metrics, and comparing A/B performance measures. All being properly scaled to the current data + future data size assuming it's larger.

Results: This was an iterative process of optimizing so we could streamline everything into one step, without having to run it multiple times. I ran into some minor bugs, a few crashes, but in the end, I was able to increase the overall processing time from 5+ hours down to < 2 hours.

iii. Signal Progression

Figure: Intersection event split per retimed intersection

My final step of the project was getting the signal progression data of each traveling vehicle. How do we exactly determine where the intersection is without manually inputting the shapefile data for each intersection?

I split trips by detecting intersection events: short windows where heading variance and lateral dispersion across vehicles increase near the retimed corridor. I then snap the candidate location to the nearest OSM intersection node using OSMnx and only keep points within a small buffer around that node. For each trip segment in this buffer, I compute the segment bearing and the corridor’s reference direction, then assign a side using the sign of the 2D cross product (equivalently, the sign of sin θ) together with cos θ for magnitude. The split keeps samples that are before the crossing point (the last sample whose distance to the node is above a small threshold). If a trip's captured samples are mostly after the node, the split is invalid, so I invert the labels to correct it. The result matches what you see in the figure: clean pre-intersection splits aligned to the retimed corridor.

...

Contact me for Repo Access ➤

Updated on August 25th, 2025 by @kfukutom