logo
  • What we do
  • How we work
        • Resources hub

          Accelerate your business with our tailored solutions that prioritize outcomes over technology.

          learn more

        • discover our blog
          We share our deep data expertise and industry insights through our blog. Our articles and thought leadership pieces empower businesses to navigate the complex data landscape and discover innovative solutions for their data challenges
          Read Now
        • Case Studies

          We empower global businesses across industries through enhanced data capabilities. Our advanced data analytics and visualisation features empower our clients to extract valuable insights from their data effortlessly

          Read Now

          7Dxperts team helped Raven put data at the heart of every decision

Speak to us
Discover our  blog

How we analysed ride-sharing location data with Databricks H3 and Tableau

Share
Learn how Frontify improves speed to insight by 99.9% with ThoughtSpot

Thank you for your 
interest!

The download will start shortly.

Related posts

5 min

Autonomous AI Ushers in a New Era of Analytics with ThoughtSpot

by Shariq Wagener
1 2 3 6
+1
Authors
11 min to read
Learn how Frontify improves speed to insight by 99.9% with ThoughtSpot

Thank you for your 
interest!

The download will start shortly.

Leveraging H3 for spatial analysis.

Special thanks to Kent Marten from Databricks team for his input and contribution.

Organisations worldwide focused on moving people and goods are looking for ways to be more efficient. Travelers/riders want more availability and shorter travel times, while delivery services want more predictability. Using geospatial and routing-based insights, location intelligence is imperative for any delivery, transportation business, or service. Databricks recently announced support for built-in H3 expressions and how H3 provides a great approach for processing mobility data. In this blog, our analytics team shares how they specialise in helping customers analsze the way we use the world by exploring complex geospatial mobility questions.

In this blog, we examine rideshare movement data for one of our clients in London. As London's rideshare and taxi markets are highly competitive, this customer wants to increase their market share using insights from location data. A ridesharing business might be successful by achieving various goals, such as riders desiring quick wait times, drivers wanting more rides, or the company wanting to expand throughout the targeted region. However, numerous aspects affect these goals, including driver placement and behavior, driver retention versus attrition, real-time supply and demand information, and more. 

To help our client, we started looking for opportunities to use a driver’s time more effectively. What are driver’s doing when they do not have an active fare? Why do drivers wait and return to airport queues for so long and so frequently? Making drivers more productive would help with customer perception of the service, and help retain drivers. 

For this project, we focused on approximately 2,000 drivers around London and surrounding areas. We conducted two types of analysis: various behavioral and geospatial analysis to understand driver habits (eg. idle time, breaks, prefers taking jobs on the way home etc.), and unfulfilled demand vs idle driver proximity to the unfulfilled demand. This will allow us to understand and categorize the true extent of the missed opportunity by poor geographical positioning when the drivers are idle.    

A few of the challenges we faced included aggregating and classifying billions of geolocations from pick-up requests and driver pings. From the data collected by the app, we built a time series model, scoring how each driver was performing over a 15 min period over 24 hours for each day of the week. Through the introduction of a series of rules from the ping data received on a driver’s app with location tracking, we could determine whether a driver was working, idle / on break or had a passenger on board. By building this data over a period of hours, days and months, a foundation was in place to help categorize every single driver's relative efficiency or inefficiency. Other common data quality challenges persisted with log data at this scale and required identifying and removing inaccurate ping locations as well as bots using the app to retrieve pricing, etc. 

 

The Data  

We worked with the customer to breakdown the data requirements into the blocks shown below. Around 1.4 billion rows, covering a year’s worth of activity, were rapidly ingested into Delta Lake tables with the help of Databricks. We had previously built a data-profiling solution on Databricks that creates 30+ data points to assess structural and data-related information. This allows engineering teams to quickly make decisions about data ingestion, target data models, and quality issues. 

 

Three main data quality issues needed to be addressed during data ingestion and transformation: 

  1. When geolocation measurements are taken at a fixed frequency over a long period of time, there are always issues. We addressed this by filtering out ping-level anomalies using administrative boundary data. 
  2. Bots impersonating customers made numerous requests for pick-up in order to scrape pricing. These records were removed by analyzing the bots' behavior, frequency cycle, and route pattern requests. 
  3. Many customers may explore the trip over a 5 to 15-minute period, check for alternative car types available until satisfied, or lose the booking to other providers because data needs more detail to identify true demand. Every browse may be incorrectly counted twice in terms of demand. 

The Solution 

We wanted to provide end-users and key decision-makers a self-service solution. The one that resulted included the ability to analyze unfulfilled demand versus idle drivers and the scale of missed opportunity, at any point in the day to see peaks and at any time of the year to see seasonality. Additionally, the solution allowed further means to examine individual driver performance geographically and behavior patterns for a year at a time.

Data Science Exploration 

We used Databricks to explore the subsets of the original data and build pipelines used in rearranging them into more usable structures. In order to filter through the data at the user end, we needed to densify the time-series data for each driver. Densifying eliminates the impact of periods of inactivity when no data is provided and enables us to account for the complete 24 hours of a driver’s day.

We then created a classification of the drivers' times based on the GPS ping frequency and app working statuses.

Classification enables our client to filter the times of the drivers based on their working status and productivity during this time. We created 15 different classification codes. To be classified, the driver must be in that category for 95% of a 15-minute period. This threshold reduced data noise and skew caused by rapidly changing statuses.

The ability to visualize a driver's journey over the course of a day is enabled by classification by status and location.

 

The number of jobs a driver completes in a day may then be calculated, and therefore trends are easier to identify. The number of jobs a driver completes in a 24-hour period is depicted in the graph below. The median number of employment is shown by the orange line.

 

 

Data Profiling 

While processing data on the Delta Lake, data profiling activities were performed. These included data structure identification, column orders, data type identifications, number of null records for each column, identifying special characters in the column, outliers, and automated partitioning of the Delta Lake tables for faster processing and querying of the data. More than thirty data quality checks were run on the data.

Spatial Data Engineering and Analysis with H3

The data was subjected to a number of analyses and transformations. In order to quickly detect and display unmet demand, idle driver supply, and proximity to an opportunity, the most important component of the solution was to spatially aggregate all drivers' ping data and customer pick-up orders.

Databricks’ built-in H3 expressions makes it easy for geospatial data engineers and data scientists to aggregate and visualize geolocation data at scale. We indexed all driver and customer data to a H3 Resolution 8, representing an area size of approx. 0.7Km. This means, all driver pings or pick-up requests within a spatial area of 0.7km will fall under a unique H3 index which can be used for exploration, analysis, engineering and visualization.

For more information on what H3 is and how to use it for geospatial analysis, please refer to the documentation [AWS | ADB | GCP]. 

H3 is effective when used for discrete binning. However, many organizations operate within geographical boundaries and typically need the ability to filter by geography (like a neighborhood or city). We ingested OSM administrative boundary data into the Databricks Delta Lake to support this. We joined the boundaries using the center-point (lat/long) of each H3 cell at resolution 8. We achieved this by utilizing h3_centeraswkt and h3_longlatash3string functions available on Databricks Photon Runtimes; also, we used st_geomfromwkt, st_astext and st_intersects functions from Databricks Labs project Mosaic.  The taxi data was subsequently merged with this boundary data so that analysis could be performed at both a county and city level depending on the requirements.

We used kepler.gl to visualize the H3-based analysis. The purpose of these maps was to see the demand at any given point in time. The darker the hexagon, the greater the demand.

 

Powering Tableau Self-Service with Databricks SQL Warehouse  

For our client, we used Tableau to build self-service data products powered by Databricks SQL Warehouse. The ability to query large-scale data live from Databricks and Tableau’s ability to visually engineer insight to convey the most complex questions has benefits beyond the scope of this blog.

We were able to query live Delta Tables with hundreds of thousands of rows live from Tableau in seconds.

The visualization below uses the H3 cells displayed in Tableau using the live SQL Endpoint. The map highlights the key insight of unfulfilled demand vs idle supply of drivers. The greater the density shown by the heatmap, the more available the idle supply in that area. This allows business users to easily identify areas of excess supply. The complementing charts are shown with a granularity of 15 minutes, making it easy to see trends in demand vs actual bookings, total resources, and idle resources. The heatmap also shows hotspots around airports, and major railway stations, which mirrored our intuition.

In the next dashboard, we allow for analysis of a single driver for a month showing working behavior and geographically where the driver spends time vs demand. The Gantt chart at the bottom allows you to track how a driver spent working hours at specific times of interest. This insight allows our client to work at a micro level and educate new drivers on geographical demand areas during different times of the day.

 

Other Applications of this Geospatial Solution

This blog has focused on the use cases related to ride-sharing, however, this solution can be applied across other sectors.

In emergency response situations, for example, when an incident occurs and is reported for immediate assistance, the time it takes for the nearest deployable resource to arrive at the scene is critical. We can map the location of the incident(s) as well as the locations of the available deployable resources in the same visualization using the Databricks-powered solution. Similarly to how a ride-hailing company wants to ensure that its drivers are in better locations to serve customers more efficiently, emergency responders want their resources to be deployable to the most effective locations so that they can arrive at the incident location as soon as possible.

While an incident can happen at any time and in any location, there are patterns of accidents and crimes that can be tracked using predictive analytics. If there are areas with a consistently high incidence of calls within certain time periods, this information can be captured and passed on to the relevant response teams control rooms so that resources can be deployed proactively.

Given the rise in fast-moving packaged goods (FMPG) and delivery services from all retailers (particularly in the food sector), there is a clear need for location intelligence. Data collected through patterns of location and time that customers order will be analyzed to improve delivery efficiency. It is critical to understand how many drivers are required at what times, such as during peak demand. Real-time situational awareness can be achieved by effectively collecting massive amounts of data with Databricks and effectively visualizing it in Tableau.
There are many more use cases to consider.

In Summary

Our customer was able to unlock the power of their data through an intuitive, effective, and scalable solution because to the ability to geographically aggregate and query data at scale using Databricks.

Using this solution, decision-makers can be proactive rather than reactive. They can determine the geographic areas of both unfulfilled demand and idle supply at the touch of a button, with the ability to join the two datasets to minimize missed opportunities. It can also improve customer experience by lowering ETAs and increasing driver revenue. This can be accomplished by using exploratory data science and question-driven visualizations to target areas where demand outnumbers supply at any given time.

The behavior of the best drivers can also be analyzed and understood, and this knowledge can be passed on to new drivers. Future drivers will no longer have to second-guess demand by using a real-time approach. They can be nurtured and nudged to travel to specific locations at specific times based on predictive demand revealed by the driver app. In the next phase of the project, we intend to explore streaming data with Databricks.

Furthermore, the analytics provided by the data may not be limited to resource managers. Insights may be made available to drivers so that they can make informed decisions about when they make themselves available for work and where they position themselves in order to be assigned to a job.

To learn more from the code used to classify, explore, and visualize geospatial data in Databricks, refer to these notebooks: 1-Classification, 2-Data-Exploration, and 3-Kepler-H3-Visualization.

Learn how Frontify improves speed to insight by 99.9% with ThoughtSpot

Thank you for your 
interest!

The download will start shortly.

Related posts

5 min

Autonomous AI Ushers in a New Era of Analytics with ThoughtSpot

by Shariq Wagener
1 2 3 6
About the authors
How we analysed <span>ride-sharing location</span> data with Databricks H3 and Tableau
Kent Marten
GeoSpatial Staff Product Manager at Databricks
Kent is a Staff Product Manager at Databricks. In this role, Kent is responsible for crafting the vision and roadmap for all things geospatial for Databricks.
How we analysed <span>ride-sharing location</span> data with Databricks H3 and Tableau
Lenka Hasova
PhD, Geopatial Data Science and Research Consultant
How we analysed <span>ride-sharing location</span> data with Databricks H3 and Tableau
Mark Balcer
Lead Consultant
Sign up to our newsletter
By signing up you agree to our privacy policy

    [custom-related-posts]

    Latest posts

    Blog

    Autonomous AI Ushers in a New Era of Analytics with ThoughtSpot

    Apr 14, 2025
    Read More
    Blog

    How to Connect Your Cloud Data Warehouse to ThoughtSpot

    Mar 26, 2025
    Read More
    Blog

    How to Train Spotter in ThoughtSpot: Step-by-Step Guide to Improve Accuracy

    Mar 13, 2025
    Read More
    1 2 3

    About the Authors

    Kent Marten
    GeoSpatial Staff Product Manager at Databricks
    Kent is a Staff Product Manager at Databricks. In this role, Kent is responsible for crafting the vision and roadmap for all things geospatial for Databricks.
    Lenka Hasova
    PhD, Geopatial Data Science and Research Consultant
    Mark Balcer
    Lead Consultant

    Ready to transform your data?




      Thank you
      for your interest to learn more

      We will contact you shortly.
      Please check your spam and junk folder if you don't hear from us within 24 hours.

      Related posts

      5 min

      Autonomous AI Ushers in a New Era of Analytics with ThoughtSpot

      by Shariq Wagener
      1 2 3 6
      chevron-down