This series of posts aims to provide a high-level overview of how machine learning can provide more robust location data products while reducing costs and enhancing privacy.
Most products based on location data serve insights into human mobility and are based on fairly simple technical methods. For example, a common workflow for a product that estimates foot traffic to a retail store or other venue may look like this:
First, a standard data workflow to estimate foot traffic to a location using GPS data. Raw GPS data is filtered and then only stay points are clustered together. Based on those clusters, supply correction and extrapolation are performed to derive the estimated person count for that store.
More sophisticated products within the industry bring more context, like home and work or area demographics, into the metric. However, the flow is always the same: first pre-process the raw data, cluster individual data points to a dwelling event, correct for technical problems of the data, and aggregate all dwelling events in an area.
This approach is simple but effective. It allows for very accurate estimates of foot traffic, especially when someone is interested in patterns over time. The technical sophistication, and mostly proprietary part, lies in the supply correction as a simple aggregation would be highly affected by the underlying issues in supply.
Looking at the chart below, impacted aggregated data (orange) can be corrected with sophisticated supply correction techniques to derive a useful signal (cyan).
Location data limitations
Even though the above methodology works, it still comes with significant limitations:
- Supply is constantly changing and requires improvements and new product versions continuously.
- Acquiring and storing all the device-level data over time comes with high costs.
- Public reputation for working with the data is low and, due to privacy reasons, the volume of available data is decreasing.
The general setup of buying location data in its raw form and re-selling it as some sort of derivative is not a viable path in the future and will decrease the robustness and quality of existing location data products.
A scalable long-term quality product requires that the aggregation is already happening on the 1st party data owner side. With that, issues around data privacy, supply inconsistencies, manipulated data, or storage costs are significantly reduced.
Aggregating data on the 1st party side is a win-win for everyone, but: how can we build a product based on already aggregated data? How do we deal with data deduplication, assignment of data to locations, or estimating foot traffic to a store? The answer is machine learning!
What is machine learning?
There are various great introductions to the basics of AI and machine learning (like this one) and a simple internet search (or asking chatGPT / Bard) will provide a better answer here than this story will. However, to make it super intuitive and easy:
Machine learning allows an artificial system to learn relationships between data without human interaction.
It is important to note that the number of input features is not limited to just one. In fact, machine learning usually uses a lot of features to train robust relationships. The benefits are manifold. For instance, when we think about our aggregated data problem coming from 1st party data providers, machine learning would allow us to learn relationships between those aggregates and a given target we would like to estimate (e.g., foot traffic to a store).
How to use machine learning with location data
There are several ways to use machine learning with location data that befit a range of industries and use cases. To keep things focused, we’ll pick one and use it as an example: estimating foot traffic to a store.
Estimating foot traffic to a store
To make things more intuitive, a case study is chosen here using GPS data coming from mobile devices. The aim is to develop a reliable and qualitative product that informs customers how many people visited a specific store on a daily basis. This is a very useful insight for companies who are interested in their competitor's store performance or site selection.
The current state-of-the-art methodology
As of today, companies who estimate store traffic based on GPS data are either doing that directly based on raw GPS data or by aggregating that raw data and correcting for supply fluctuations. The current state-of-the-art is to have a high data volume to build raw device-level feeds or aggregations of the underlying data. However, with low data volume, this methodology is limited.
When the product comes with high enough data volumes, both product methodologies (device level and aggregation) do work and major concerns are more about data privacy, supply fluctuations, cost, and trust in the data supply. However, when the data volume is low or the store is located in an area with a generally low market share, simple aggregation does not allow for a product since it would always end up with “0” counts. Given the general decrease in available location data, this is already a problem for the industry.
The better way: Using a machine-learning model
Keeping in mind the conditioning example from before, a machine-learning model does simply learn relationships between conditions. Similar to the dog learning that raising a paw leads to a reward, a machine learning model can learn that if more people are close to the venue, there are most likely also more people inside the venue.
Machine learning models allow the training of relationships between the surrounding foot traffic to the foot traffic inside the venue. This methodology holds even when the data supply is of a deficient volume.
In other words, the purpose of machine learning is to train a relationship (or model) that describes how foot traffic inside a store changes based on fluctuations in traffic outside the store. For example, imagine that on a given Saturday there is a grand opening that leads to the situation that twice as many people are close to the store as on a regular Saturday. In that case, it is very likely that more people will make their way into the store.
Of course, the relationship between foot traffic outside the store to the inside must not be linear. But that is also not the only relationship for a model to learn. Just think about it, what else affects foot traffic to a store that can be measured? Because, essentially, every data that relates to the store traffic improves the quality of the model. A few datasets that enhance those relationships are precipitation, area population, demographics, day of the week, holidays, and many more.
Machine learning is capable of using all these different datasets and combining them into a single model that describes the relationship of how foot traffic inside a store changes based on data describing the surroundings.
Even small changes in the supply volume can have a massive negative impact on an aggregated data product without proper correction. We at Unacast specialize in the supply correction of GPS and Telco data and have automatized functions that correct data daily.
This said, it is essential to point out that scalable long-term products require that aggregation is already happening on the first party data owner side before machine-learning is applied.
Our machine-learning models allow the training of relationships between foot traffic inside a venue and the foot traffic surrounding it. This methodology holds even when data supply is of a deficient volume. Our technology is safe and secure.
But even though machine learning offers a lot of opportunities, it is not something that can solve everything and comes with limitations that need to be addressed. We will address that side of things in Part 3: Nothing is perfect, so what are the Pros and Cons?