When you have a brand new dataset, it can feel like you’re exploring the wilderness. You might have all the tools you need, but… where do you start?
Source: GIPHY
Today, I’d like to take you on an adventure to explore the unknown with Alteryx Designer Cloud, Designer Experience. From the wilderness to the cloud, we will examine our dataset and delve into the Designer Cloud features that enable speedy data cleaning and investigation.
Uploading the data
To start, I uploaded the CSV file I want to explore in the Data tab on the Analytics Cloud Platform.
Then, I created a new workflow in Designer Cloud, Designer Experience. I pulled in an Input Data tool and selected my 2023 Car Dataset.
Data Cleansing
Designer Cloud showed me a sample of my data instantly in the Sample Results window. That's a lot of red! After hovering, I could see this meant there were leading and trailing spaces.
Source: GIPHY
Time to pull in a Data Cleansing tool and select “Leading and Trailing Whitespace” to remove.
If I wasn’t familiar with that tool, I could have clicked the column header in the Sample Results (see red arrow). When suggestions are turned on (circled below), this panel contains helpful options for transforming my data. Hovering over each suggestion will give you a preview of what the transformation will do to your data.
I also noticed some fields I wanted to treat as numbers with letters in them, like the Acceleration field. I used the Suggestion window to configure my tool correctly.
Then, I changed the datatype from string to float in the Sample Results. When I clicked on “Show Profiling,” I could see the distribution of this numerical variable and hover over values to see how many fall in that range!
The customer rating field is another one I want to make numeric. I saw that the first few ratings in the dataset were out of 5 (i.e., 4.7/5, 4.5/5, etc). I wondered… is this true for the whole dataset?
After I added a Text To Columns tool to my workflow and split the column on the “/” delimiter, I could see that, in fact, 99% of the ratings were out of 5. The other 1% had null values.
Note: the dataset I used for this analysis was small (only 164 rows), but if it was much larger, only the first 10MB would show in the Sample Results window.
The resampling functionality is also helpful after you’ve added more tools to your workflow. If you have a Join or Filter tool that outputs little data to the Sample Results, resampling can help you see more results and understand how the transformations affect the whole dataset.
Data distributions and outliers
After cleaning the data and fixing data types, I looked at the distributions of my numeric variables and identified outliers.
When you see an outlier during your expedition (Source: GIPHY)
When looking at the rating column, I noticed an outlier on the low end of the ratings. I clicked on the small bar on the column profiling and added the suggested filter to examine this record. When I clicked “apply,” the filter tool was added to the workflow, and I could see that it was the Mazda CX-5 that was rated the lowest in this dataset.
Following a similar approach, I identified a high outlier in the mileage column. Turns out, it was a Tesla Model 3. Depending on the goals of my analysis after this, I might exclude that value since comparing electric cars to gasoline cars is like comparing apples and oranges for some metrics.
There was also an outlier on the high end for torque, which was (no surprise) a Camaro.
Once my data types were correct and my data was clean, the profiling feature in the Sample Results window made my data exploration a breeze. I could spot outliers, investigate the records, and filter them out quickly if I wanted to.
Conclusion
In a short period, I learned a lot about my dataset using Designer Cloud. I created a workflow for the EDA through a combination of bringing in tools and using suggestions in the Sample Results to insert tools for me. When I got stuck, the profiling features helped me identify my next step. Finally, one of the most exciting parts—I did all of this analysis in my browser on my Mac!
If you would like to start your data exploration adventure in the cloud, you can sign up for a 30-day free trial here.
If you get lost on your expedition, visit the Designer Cloud discussion forum to get your questions answered. And one more tip—the Designer Cloud interactive lessons are a great way to accelerate your learning!
See you in the cloud! (Source: GIPHY)