This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
There was a recent question on the Alteryx forum: How to use the percentile in summarize. The question misunderstands the percentile function in the summarize and is looking for something slightly different, although with similar math. So what does the percentile in the Summarize tool do? From the help:
Percentile: Calculates the specified percentile value for the group. The percentile is calculated by sorting the data and returning the row value relative to the specified percentile and its position in the sorted array – the largest value is the 100th percentile, lowest value is the 0 percentile, median is the 50th percentile, the 25th percentile is the value in the middle of the median and minimum, etc.
So in short, in this case, you specify the percentile you want, like the 80th percentile, and it returns you the value that 80% of the other values are below.
What this commentator is asking for is the reverse. They have a column of values and want to know the percentile of each value. Mostly this is a matter of sorting and then numbering all the records appropriately. I built a quick macro to make this process simple – if you want to skip to the answer, you can download it here.
This is a pretty simple macro. Since calculating the percentile involves sorting the records by the field in question, I added some complexity to avoid changing the resulting sort order from the macro. Before sorting I add a recordID to capture the original sort order so that I can re-sort of it at the end. I also add a __ to the beginning of all the fields I use internally. The purpose of that is to make it less likely that I will stomp on a consumer’s fields in the output.
There are a few subtleties of the algorithm. Percentile is defined as the percentage of records not including the current one that are lower than the current record. This means that the lowest record has a percentile of 0 and the highest 100. For each record, the basic formula is:
The reason for the -1 is that the current record does not count in the total. If the current record counted in the total, then the highest value would no longer have a percentile of 100.
The other complexity is that values that are the same should result in the same percentile. For this reason I apply the formula in a multi-row formula so I can look at the previous row and, if the same, assign the same percentile.
This is a great example of how flexible the macro system is in Alteryx. Almost any function that isn’t there can easily be added as a macro. Once again, if you just want to use the macro, you can download it here.