HI newbie here. Hello, team members. I am working on a file that should be easy to clean up but I am trying to find the most efficient way to do so. The file has multiple columns but this are the critical ones (in this format):
Employee id | last name | first name | Course A title- status | Course A title- completion date | Course B title - stat | Course B title- completion date | Course C title- st | Course C- completion date |
1234 | blue | ivy | completed | 12/8/19 | In progress | |||
1234 | blue | ivy | completed | 12/8/19 | ||||
1234 | blue | ivy | In progress |
I want it cleaned like this:
Employee id | last name | first name | Course title | status | completion date |
The biggest issue are these:
a) there are about 10-15 columns with the completion status and an additional 10-15 columns with the title of the course and the completion date. The course title columns are not uniform. For example, sometimes the status column is named "Course title" stat, sometimes "Course title" status", sometimes the column has the course title but no "status" in the column's name. Same with the "completion" heading. I tried using Regex to first cleaning up the titles so they would be uniform and then join every thing using the column name. But that did not work because I could not figure out how to use regex to remove all unwanted characters and have all columns the same name.
b) The other issue is, there are several records for the one person. So, sometimes we end up with having about 30 records for the person to indicate their status for the 10 courses, because each record indicates the completion of the course.
I hope this is clear. I tried using transpose and formula using regex but I am a bit stuck now.
Thanks
Solved! Go to Solution.
Would it be safe to assume that
Thanks, I think the last approach may work. It is similar to what I started. Let me try it out.
Hi @irmb
OK, so with the similar columns, I wanted to create a workflow that can easily adjust based on number of Courses. So transpose is definitely the way to go. You can also leverage the dynamic tools to help you. I separated the course status from the completion dates to easily summarize, and remove where there are too many rows per course respondent.
In the dynamic rename tool, I used a Regex Replace to take only the "Course X" string from the header. Assuming that the pattern is always course title status column, followed by the completion date, this will rename the status column as Column X, and the completion date column as Column X2.
Dynamic Select separates the fields representing status from completion date, so that we can transpose and summarize, and return only one row for each course for each respondent (if someone was in progress and then completed, the formula tool changes the value to Completed. I threw in the Data Cleansing tool to standardize case in the status, and clean up extra whitespace.
Join everything back together, rename your columns, and hopefully this all works for you.
Let me know if this helps. I tried to make it as dynamic as possible.
Cheers!
Esther
HI guys, thanks for your great ideas. I was testing some of the ideas out and realized something. The column names were not standardized "course name - status" "course name - completion date" because when the Excel is imported into Alteryx, some of the column names are truncated. How do I adjust the input tool so that it does not truncate the column names? I searched on the group and saw that I need to change the field length. Looks pretty straightforward but I don't get that option when I see the input tool's options. Screenshot below.
I don't think this is possible with Excel files. Also, the field length applies to the actual values rather than Header names. Could you try moving the headers into the first row of data by selecting the option below? This will hopefully prevent them from being truncated after which you can process them and move them back into the original position by Dynamic Rename
@MichalM I'm testing out your solution and it seems to work pretty well except using the - delimiter. Unfortunately, some of the course names use the - several times and not only to separate the status or completion names from the course name. Is there a way to set the delimiter to be - followed by the initial letters of the words status and completion?