We have historical reports from Vertex L Series, an ancient sales tax calculation software, that we need converted to Excel. Essentially I want Alteryx to read the headers that are embedded in the text, and take the string of (non-space) characters that follow. If a header is missing from the text, the table should create a null cell and skip to the next available header. Once it gets to the end of the header list, it should create a new row and start again.
Here's a shot of the TXT file (attached) with sample data, headers in yellow and data in blue:
Here is the FLAT file I'm using after my manual edits:
<flatfile version="1">
<file
path="G:\Tax\Alteryx\Alteryx Workflows\Indirect Workflows\Vertex Reports\Vertex test.asc"
eoltype="crlf"
trimWhiteSpace="t"
allowShortLines="t"
allowLongLines="t"
/>
<fields>
<field name="CITY-NAME:" type="V_String" length="64"/>
<field name="COUNTY-NAME:" type="V_String" length="64"/>
<field name="GEOCODE:" type="String" length="11"/>
<field name="COMPANY CD:" type="String" length="5"/>
<field name="DIVISION CD:" type="String" length="5"/>
<field name="PRODUCT CD:" type="String" length="3"/>
<field name="CUST CD:" type="String" length="12"/>
<field name="TRANS CD:" type="String" length="1"/>
<field name="TYPE:" type="String" length="4"/>
<field name="SUBTYPE:" type="String" length="1"/>
<field name="INVOICE NO:" type="String" length="6"/>
<field name="INVOICE DATE:" type="String" length="10"/>
<field name="USER INFO: I" type="String" length="11"/>
<field name="IND/TYPE/NSC/RATE-" type="V_String" length="112"/>
<field name="GROSS SALES-------" type="V_String" length="112"/>
<field name="EXEMPT AMT/CODE---" type="V_String" length="112"/>
<field name="NON-TAXABLE AMT/CD" type="V_String" length="112"/>
<field name="ZERO RATE AMOUNT--" type="V_String" length="112"/>
<field name="TAXABLE AMOUNT----" type="V_String" length="112"/>
<field name="TAX AMOUNT--------" type="V_String" length="112"/>
</fields>
</flatfile>
(I have the string data after "USER INFO: I" set to 11 because any characters after 11 are junk data.)
Unfortunately, when use this as a data input in Alteryx, the result is completely wrong:
Any insight is appreciated.
This forum discussion got me as far as I am, so my thanks to you:
https://community.alteryx.com/t5/Alteryx-Designer-Discussions/4gb-txt-file-without-delimiter-to-open...
There's no easy way to do this. The flat file format has no way to define different formats for different lines and your file has several differing row formats and several different groups of rows that have to be processed individually. To handle these kinds of files, I've found that it's best to read the entire file in as a series lines with no internal delimiters and then build streams within the workflow to deal with separate line formats and row group.
A row group is group of rows that are some how related to each other. In your file a State group is all the records for a given state. These rows are related because the value "Georgia" which appears lines 5, 30, etc., needs to be applied to all the rows that follow it until we get to another state. The CityCounty group of rows is a group that has 3 lines of header that needs to be applied to the following 5 lines of data. CityTotal, CountyTotal and Statetotal groups are similar with a number of header rows that apply to number of data rows.
The generic process is this.
1. Identify the various group types in your file. These types of files general have unique line formats for the various groups so that humans can tell what the current section of the report deals with. Look for these markers to identify where each group starts
2. Mark each group with a unique ID. The ids have to be unique within each group type but can overlap across group types.
3. Process each of the group types in turn.
a. Process all the header rows and join all the header fields into one record per group
b. Process all the data rows and union all the resulting datarows
4. join the header rows to the data rows on groupID which prepends the header field info to each data row
The first container uses a series of multirow tools to mark out the various groups in your file. Each uses a similar formula
If startsWith([Field_1],"CITY-NAME") then
[Row-1:CityCountyGroup]+1
else
[Row-1:CityCountyGroup]
endif
If this row is the first row in a new group then increment the previous row's group number. Otherwise use the previous row's group number. The criteria is change based on the specific group you're marking.
The last field in the container fills the most recently found State value to all following rows.
The next container defines the process for the CityCounty group. The initial Sample tool takes the first 8 rows from each CityCounty group which gives us our entire group. The key thing here is that the all the Sample tools in this container group by the same CityCountyGroup field. The filter removes all the records with CityCounty=0, which correspond to the records before the first CityCounty row, which is line 7 in your case. The process now splits in to a series of streams, one for each of the unique row type in the group. The 3 header row types are processed in the Header Rows container and all the fields are combined into one row per CityCounty group. The Skip 1st 1 sample tools remove the previous header row so the next sample tool can work on the following header group. I used Regex Parse tools to extract the relevant info from the rows, but a formula tool, similar to what is used in the Data rows container would work as well.
The Data rows container does a similar process for each unique set of data rows. First it handles the row with the tax rates and flags and then the following 4 data rows
At the end of the process your data will look something like this
with another 15 or so header fields off to the left.
The example I've attached only deals with the CityCounty group type. To parse the other types, copy the CityCounty container connect it the output of the Fill State tool, like the small Process CityTotal group records container. You'll have to modify the contents of the container to deal with the specifics of the group that you're processing. It will be a fair amount of work, but since your sample file mentions Page No. 13185, it looks like you have a lot of data to process
Dan
User | Count |
---|---|
18 | |
14 | |
13 | |
9 | |
8 |