I am convinced one of my main problems is….I don’t know the right vocabulary, so when I search for a solution, I am not asking the right question.
So forgive me.
CHALLENGE:
I have a list of businesses with various data attached to each. In this case, Walmarts in Wyoming.
I want each row to represent ONE physical location. However, the data I have breaks down the subsegments of Walmart (Pharmacy, bakery, etc) I want to agg them up in a certain way.
full_walmart_list is an example of the complete file. You will see there are 51 records, but analyzing the zip/lat/long reveals there are only 7 walmarts.
Final_walmart_list is what I want as the result of my workflow.
Here are some ideas/assumptions/issues:
- I think I need to distinguish walmarts by looking at zip/lat/long. I can not just rely on lat/long (and yes that is unfortunately the level of detail on lat/long I have). If I did just lat/long, the first 14 records would be considered 1 walmart – but they are in different zips. And that’s true. In Cheyenne, there is a Walmart 5 miles away from a Walmart.
- Likewise, I am guessing I can’t rely on JUST zips – as there are some big ip codes in the U.S.
- I want to add up all the EMPLOYEES, REVENUE and COUNTS of Walmart departments. This is represented in columns G,H,I in final_walmart_list.
- I want to collect the NAICS code (business code) of all the departments I did add up. This is represented in columns J-R in final_walmart_list.
- Now the trick here is……what ONE record do I choose to “represent” the business in its entirety? Here’s what I came up with – use the record that has the most employees. Doesn’t always work – but don’t know what else to do.
- Remember – I am trying to do this with EVERY business in the US – so it’s not just walmarts in Wyoming.
I wouldn’t even mind anyone pointing me in the right direction (which tools to use)
Thanks!
Corey