Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Alteryx skipping input files because they don't match the schema...but they do!

SergeantJack
8 - Asteroid

Hi.  I have a workflow set up and connected to a macro.  The macro makes the workflow cycle through every .xlsx file in a directory and pull out data from five different tabs in the Excel files.  These tabs have different schema.  Each tab's information is deposited into a separate workstream, which is then united later in the workflow.

 

My problem is this.  Currently I have four Excel files in the directory.  The first three Excel files have no data in Column X on one tab, and this is perfectly acceptable; however, the fourth file does have data in Column X.  When Alteryx reads in the data files, it determines that the fourth file has a different schema than the first three, and it skips that fourth file.  From what I've read in other threads, Alteryx is assigning a different data type to the fields with nulls than it is to the fields in the fourth that have actual data, which presents itself as different schema, but I can't figure out a way to verify this or correct it.

 

How do I get it to stop doing this?  I have fairly complicated (to me) setup for reading in multiple files and pulling data from multiple tabs, and I haven't been able to figure out a way to make Alteryx handle this problem correctly.

19 REPLIES 19
DataNath
17 - Castor

@SergeantJack in that case, I think the warning is due to using a wildcard here. The usual batch macro approach relies on all iterations of the control parameter being able to run, before applying the fix and then unioning all of the results at the point of the macro output. Whereas, using a wildcard will run the iterations and try to union them as they happen, which in this case it can't because of the schema change. I'm not too familiar with your approach but is there any way you could distinguish between your streams and feed something more concrete into your control parameter rather than a wildcard?

SergeantJack
8 - Asteroid

@DataNath The wildcard is necessary.  The idea here is that customers will drop files that need to be processed into a folder, and this workflow will run once per night, gather up all the data available from all files in that folder and then output all of them as a single text file.  I'm using the wildcard because the customers will use their own naming conventions for the files.

 

I don't believe the wildcard is causing the problem.  If I were to show you the results of the first data stream (marked 'Data Items'), you would see that all four files are picked up by the workflow, including my dummy file.  If the wildcard were preventing this from happening, then that first work stream wouldn't run.

 

I may be misunderstanding what you're saying, though.  But if I can't use a wildcard, then I don't know how to get Alteryx to process all the files successfully.

DataNath
17 - Castor

Ah I get what you're saying now @SergeantJack. In that case I'm a little stumped and would be curious to see how this ends up getting solved so I'll keep an eye on it. In other scenarios I've never used a wildcard so it may just be something simple I've not come across before. Only final thing I'd say to check is if your dummy file differs in any other way to the other files in that stream, though I'd imagine you haven't overlooked that, just a simple check.

SergeantJack
8 - Asteroid

@DataNath Thanks for your help. I am completely stumped by this, and it's wrecking my workflow.  I found it was a problem yesterday, and it was supposed to go into production last night, so the pressure to get this fixed is high.  Hopefully this will make sense to somebody else and I can get moving on the solution.

Luke_C
17 - Castor

Hi @SergeantJack 

 

To follow-on to what others are saying the way around the schema warning is to use a batch macro. If I understand correctly, the file names are variable but the tab names are not - is that correct? I fail to see why using a wildcard is necessary then. What I think would work is:

 

  1. Directory tool to read all files, filtering the ones you want
  2. Batch Macro 1: Read all files to get the 'list of sheet names' for each
    1. The input tool within the macro should be set to output the full path as well
  3. Update the full path of each record to append the tab names
  4. Pass each type of file to a batch macro to read in the data, since you have the full paths you should not need the wildcard.
  5. Make sure the macro is set to configure by name, which is what allows the schema error to be avoided

Please let me know if I misunderstood your fact pattern.

  1. Luke_C_0-1665611023592.png

     

SergeantJack
8 - Asteroid

Hi @Luke_C 

 

I'm perfectly willing to accept that I don't have my workflow or macros set up to handle this well.  I'll break down the file structure to make sure we're on the same page.

 

- File 1

  -- Tab A

  -- Tab B

  -- Tab C

  -- Tab D

  -- Tab E

- File 2

  -- Tab A

  -- Tab B

  -- Tab C

  -- Tab D

  -- Tab E, etc.

 

Tab A in any file should be named exactly the same and have the same schema as Tab A in any other file, and the same with Tab B, Tab C, etc.  Within each file, all tabs are different from each other, with different schema and data types.

 

My workflow needs to:

- Iterate through every file in the specified directory

- Generate the FileName as a distinct column

- Separate the data and schema from all the Tabs A and send it into workstream A

- Separate the data and schema from all the Tabs B and send it into workstream B, etc.

- Process each workstream differently (I don't have a problem with the workflow from this point on)

- Join all the workstreams together based on the FileName column

- Write the results to a single text file

 

My problem is that in some of the tabs, the first file processed has nulls in some columns instead of data, so Alteryx sets the data type of those columns to 'double'.  If any subsequent files actually have data in those columns and its data type is not double, Alteryx determines that these later files have different schema, and skips over them.

 

I need Alteryx to do one of two things:

1. After a tab's data has been disseminated into one of the workstreams, skip any rows/records that have nulls in a key column.  This would be the ideal solution, but there doesn't seem to be any way to configure the Input Data tools to accomplish this.

2. Stop coercing the data type of null records to be doubles.

Luke_C
17 - Castor

Thanks @SergeantJack, I think what I described would achieve that, filtering out records with nulls would need to be done after data is input, though.

SergeantJack
8 - Asteroid

@Luke_C Thanks.  I'm trying to figure out how to do that, and how it is different from what I already have.  Currently I have my main workflow set up as a Batch Macro, with the Directory tool located as a separate .yxwz file.  What you described sounds very different.  I set this workflow up over a year ago, and I'm trying to figure out why I set it up this way and what I need to do to fix it.

SergeantJack
8 - Asteroid

@Luke_C can you help me understand what 1. and 2. from your list above should look like?  I'm having a tough time trying to figure this out.

SergeantJack
8 - Asteroid

@Luke_C never mind.  I found a workaround.  I just selected "First Row has Data", and the headers forced Alteryx to look at everything as string data, and then I just eliminated those rows later in the workstream.

Labels