This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I'm trying to find the best way to clean up my input data in Alteryx, as each original text file is 1.2+ GB. Essentially, I have input data that's broken up as follows:
Where the "Field #" columns are deleted are left out after import. However, some entries are missing a field, which shifts everything to the left and messes up the remaining data:
So, in my example, Field 2's data was missing, so zip code was thrown into field 2, and the remaining data was shifted to the left as well. There aren't too many entries like this, so I don't need to restore their data to working order if there isn't a way to do that here. But is there a way to have Alteryx find this inconsistency in the data so I can clean up these entries? I've incorporated a couple filters that only keep the correct values so far, but I don't have a good way of knowing whether or not I got everything.
When Alteryx ingests a data set, every row will have the same number of fields, so in this case, missing columns will result in null values in the last column. By filtering on IsNull([Last Column]) you will catch all the rows that have dropped fields although you will also get the rows that actually have a null value in the last field. This assumes of course, that the row values stay intact and you don't have the possibility of a row value being corrupted and split into two columns.
I can't think of any "one size fits all" solution that will actually fix the bad rows. If you're going to do that, you are going to have to make some additional assumptions about the data and what kinds of corruption are possible.
If columns have unique characteristics, then you might have a chance of putting REGEX rules in place to check each value and determine if the value is in the correct column. If we can establish those rules, I would first number the rows with a Record ID tool, then I would use a filter to grab the rows with null last columns and transpose those rows. Depending on how complex the row rules are, I might use a batch macro for the next step which is to process the values grouped by the Record ID. If each row has unique patterns, then I would test each value for each pattern and assign column numbers accordingly. Then I would crosstab using the new column numbers and finally use Dynamic Rename to restore the original Column names. Instead of assigning column numbers, you could also create new columns and populate them.
In your data set, each column except for Field 2 has a distinct number of characters. If that is a hard and fast rule, then we should be able to sort out the correct values with some degree of success with a few exceptions. The random text is going to be a problem unless you can define some kind of pattern (for example, length of at least two and at least one non-numeric). Otherwise you're going to have to run a bunch of tests against other columns to make sure you aren't throwing content in that field that belongs somewhere else.
I created a very simple example where each field has unique characteristics. I decided to create new columns instead of adding column numbers and doing a crosstab, but either method should work. Hopefully this can be a jumping-off point for you.
(By the way, I broke out each formula tool separately for clarity, You could have combined the formulas in one tool to shorten the workflow.)
Thanks for your help! Oddly enough, the majority of broken records still populated the correct value in the last column (not sure how, must've depended on the length of certain "field #" columns). However, there was one column near the "breaking point" that consistently populated null when the entry broke, so I was able to use that instead. I haven't gotten any conversion errors and haven't been able to find any other broken rows with other filtered values, so it looks like this worked!
Also, thanks for the very detailed solution concerning trying to restore the data. Sadly, when the break happened, there were fields close by that didn't have a set number of characters (think like a company name or address), so I was unable to use this method. It was very interesting to go through though, I'm still fairly new to Alteryx and haven't seen any of the REGEX functions in action, so this was a nice learning experience.
Is this data originally coming from a text document? From your last posting it sounds like that may be the case and some other application is parsing the document before Alteryx sees it. If that is the case, you might want to start with the original document and use Alteryx to parse it.