I started working through this on 3 Jan 2022, and I have come to believe that the data is not sufficient to answer the questions.
While it is true that the input spreadsheet data is in a very strange format, that is, I believe, not much of a problem. I posit that the main issue is that the data is already aggregated, and so is unable and unfit to answer the questions posed.
The numbers in the data are numbers that indicate the probability of a particular resolution given a particular demographic category. That is, assuming that we are already working with only people who are in a particular demographic category, what is the probability that the person chose a particular resolution? This is evident because the sum of the values across an entire resolution row is 1 (rounding means that this isn't always exactly true, but it's the idea).
The questions to be answered are -
1. What were the top 3 New Year's resolutions for 2019?
2. What percentage of fitness-related resolutions (exercising more, losing weight, eating healthier and improving health) were made by suburban men and women?
3. Which group of people were most likely to keep their resolutions in 2018?
1 - "Top" in what sense? Putting aside that everyone only got to choose one resolution (which is a mite strange if you ask me), we don't know the objective number of people from the data. Finding out what percentage of all the people chose a particular resolution is not possible if the limit of our knowledge is what percent of a particular demographic chose a particular resolution. The answer to the question would depend on how much of the population was in each demographic category. As an example, having the population evenly distributed between the geographic regions will provide a different answer than if the distribution is highly skewed.
2 - The question asks "What is the probability of being a suburban man/woman given that the person already made a fitness-related resolution?". The information that is provided is the probability of having made a fitness-related resolution assuming (or given) that the person is a suburban man/woman. Those two probabilities are directly reversed from each other, and coming up with an answer is a classic case of applying Bayes' Theorem. Unfortunately, Bayes' Theorem requires knowing the overall probability of being a suburban man or woman in this context, and we are not given that information. We might assume a particular value, but without more information about the data we can't know how close or far our assumption is from the reality of the sample.
3 - The question, restated, is - assuming we know the probability of "keep" (or "yes") given an arbitrary demographic group, which value, over all the demographic divisions, is greatest? This can be read straight off the spreadsheet because it is exactly what the spreadsheet is giving us.
I've enjoyed the Alteryx community challenges mostly so far. This one, though, doesn't seem doable to me without more information.
I'm interested in other perspectives. What do you think?