Alteryx Designer Desktop Discussions

olly_jones_98

Hi,

I am currently faced with the following challenge:

My data (attached) contains hexagonal grids within a country (based on Uber's H3 system)
Each of these hexagons has been allocated a number of premises (i.e., the estimated number of premises within the hexagonal polygon) as well as area and premises density data (calculated using spatial info)
What I need to be able to do is form contiguous clusters of these hexagons with constraints on the total premises able to be contained within a cluster and the maximum distance of any hexagon's centroid from that of the cluster 'core' which should be the mostpremises-dense hex in a given cluster.

I have written a clustering algorithm which should yield appropriate results however it is extremely computationally expensive and will take days to run. Does anyone have experience with this type of problem as I have struggled to find solutions elsewhere

Qiu

@olly_jones_98
I think we have spatial experts here. but could you eloborate your objective with "plain" words? 😁

KGT

OK, this is a fun one, but I don't have time right now to tackle it. Maybe later tonight... I'll say straight out, I'm not a spatial expert, but have done a bit with spatial data over the years.

One of the key facets of this type of problem is setting up the question correctly. Is this a predictive problem, or is it a location optimisation. In this case, as the clusters can take any shape, rather than choosing from possible clusters, this looks more like a predictive problem. I expect, that you've got an idea of the number of clusters whether it be closer to 10 or 1000. Having the distance from the most populous hex, rather than the centre of the cluster will be another issue to overcome, and may make it a multi-stage refinement.

What I need to be able to do is form contiguoous clusters of these hexagons with constraints on the total premises able to be contained within a cluster and the maximum distance of any hexagon's centroid from that of the cluster 'core' which should be the mostpremises-dense hex in a given cluster.

So, they have to be contiguous but the cluster core is determined by the most premises-dense. I would need to play with it a bit to see what works, but my initial investigation would include:

Sampling: I would use a Map Input to draw a shape around an area (London for instance) and spatially match that to the hexagons to sample a set that makes sense. This will enable playing with the data in a quicker timeframe.
K-centroids clustering. Create a centroid (as X,Y) for each hexagon and use K-centroids to cluster them on distance. This will require some testing to play with the number of clusters. As you increase the number of clusters, processing time will increase dramatically.
- Putting weighting in based on the Premises somehow would be the key. Could be done brute force by basically creating multiple records to represent multiple premises, but that would definitely require rounding.
The final solution will most likely involve the summarize tool to group the clusters as that has lots of options around spatial.

My theory being that if we created something for London, that could be expanded to the rest. Purely anecdotally, I would expect more dense areas to have different facets to the clusters in terms of hitting max premises rather than max distance and so analysing the clusters will be interesting.

One last thought, when clustering, the algorithms are comparing everything and so if you divided it up into X groups with overlapping segments then you could get faster processing, and a rough answer. No point comparing groups that are never going to be in the same cluster (Scotland and London for instance).

Hopefully, that provides some ideas at least.

olly_jones_98

The core of the problem is thus:
What is the most efficient way to cluster together hexagonal polygons (containing a certain population count, i.e., a number) so that the entire grid is split into clusters each of which has a cluster core (designated as the most densely populated hexagon in a cluster) and each cluster needs to be constrained so that no hexagon within a given cluster is more than Xkm (let's say 20km) from the cluster core and constrained so that the maximum number of premises within the cluster is Y (let's say 40,000).

This is the basis for the problem and I have solved this using an iterative macro (will upload as a reply later) however the iterative macro is not able to do the following two things:
1) Run efficiently - each iteration for a single small country like the UK takes over 5 hours even when running on smaller subsets such as individual regions and then combining in post-processes

2) Target a specified number of clusters - I do know broadly the target number of resulting clusters so if there is a methodology to bring this constraint in that would be ideal as it is currently just trial and improvement on the parameters to get to a sensible resulting number of clusters (or crude post-processing)

olly_jones_98

Hi KGT,
Thank you for the reply. I will lay it out that I am not a data-scientist also so may not interpret some terminology correctly at times. However, I am fairly certain thsi would be classified as a predictive regionalisation problem.

I do know the ideal number of clusters in this instance, yes, but the quantum across a country is more in the region of 5,000–10,000 clusters (hence the heavy processing)

The challenge I have found with a k-means based (and similairly with other clustering algorithms using Python-based solutions) is that they are challenging to correctly constrain by the total premises. Interesting suggestion around the brute forcing, however I think with the quantum of population count being in the 30-million range, the processing of this quantum of spatial objects would be extremely computationally expensive.

On the final point, one interesting idea I did have was if we used my current iterative approach, but could define the centroid of the hexagon being considered in a given iteration, we could use a lat/long range to filter out any hex's that are outside the boundary but purely numerically and thus slimming down the computational requirement?

olly_jones_98

This is the output of the current process for a single region and my iterative process

Alteryx Designer Desktop Discussions

Geospatial clustering algorithm with constraints on maximum distance and total premises