Let’s talk Alteryx Copilot. Join the live AMA event to connect with the Alteryx team, ask questions, and hear how others are exploring what Copilot can do. Have Copilot questions? Ask here!
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Search Across Fields for a Specific Word Despite Schema Variability

hellyars
13 - Pulsar

A normal question / issue / challenge...

 

How can you search for a specific value in a column when the column position and name might change across input files passed through a batch macro? 

Macro reads in multiple files to union large table. Schema variability due to PDF to Text variability  Details below...

 

 

CONTEXT:

 

I have 230 input .csv files (each a page of a very long table).

I used AWS Textract to extract the table data from PDF files.

 

When using Textract to extract table data the output includes both the table rows and its confidence scores for each table row.

The table data rows will be at the top and the confidence scores at the bottom of each file with a few null() rows between.

The start of the confidence data will also have the phrase "Confidence Scores % (Table Cell)" in the first column in the row before the scores (see image example below).

Importantly, "Confidence" will only ever appear in this context (and only once per file).

 

Normally, I use a formula tool to create a KEEP field and use an expression if

 

 

if startswith([Field_1],'Confidence') then 'No' else null() endif  

 

 

I then use a Multi-Row Tool with to fill-down "No" to tag all the rows that contain the confidence scores and not table data. I can then filter out the rows containing the confidence data with a Filter Tool.

 

if isnull([KEEP]) then [Row-1:KEEP] else [KEEP] endif 

 

 

PROBLEM:

 

I use a batch macro to union the 230 or so files together.

Due to PDF to Text challenges, the schema is not the same.

The batch macro is configured to Auto Configure by Name setting to allow for these subtle differences.

 

That means that the value Confidence Scores % (Table Cell) will not always appear in the first column.  It may appear in the second (and possibly the third).

It also means that the name of the first column may not always be the same (again due to the challenges of PDF to Text).

 

IMPORTANT NOTE: 

 

I get rid of the confidence data rows in the batch macro so that the union output only includes table row data and no confidence data rows.

 

 

QUESTION:

 

How can I search for startswidth "Confidence" across multiple fields when I do not know the potential field name?

Can you do something similar to if FieldNumber <3 and startswidth "Confidence" then TEST = No 

 

 

something along the lines of (not literally) 

if FieldNumber <3 and startswidth "Confidence" then TEST = No  

 

 

 

 

Screenshot 2024-03-30 143542.png

 

5 REPLIES 5
hellyars
13 - Pulsar

I should add. 

I can't always be sure/trust that the null rows that exist between the table data and confidence scores are a reliable means to distinguish between the two sections to the degree that Confidence provides. 

 

A variation of the question might be.  

 

How could I implement a test that says if all fields for two rows are null then delete everything below those two rows?

Note:  Confidence appears in the row immediately after the second null row and all other fields in that third Confidence row are null.

 

Screenshot 2024-03-30 151252.png

hellyars
13 - Pulsar

I have a bolt-together solution that involves within the macro changing the field names to generic names, filtering out non table rows, and then restoring field names for that input file. 

It works, but you have to turn AMP off in both the Macro and the worksheet where the Macro will be used. 

Is there still a way to search across multiple fields in a row for a value or criteria (without turning AMP off etc.)??

 

Screenshot 2024-03-30 155612.png

 

mzak89
7 - Meteor

Hello!

I'm not sure if I understood the problem correctly, but maybe transposing data into one column and the do the filtering should allow you to do the operation on multiple column at once.

1. I numbered my records with record ID
2. Transpose all data into one column
3. Do your keep on every column with regular formula tool
4. Group by ID and concatenate the keep flag. This way I make sure that every column of every filed is filteround if any of them contains "yes" im my flag
5. Now I got ID to be filtered out
6. Joining the original data to the ID's that left

Sample workflow in the attachment :)

Kind regards
Mateusz







mzak89
7 - Meteor

Hello!

Sorry, I've missed your previous post and misunderstood the problem.

But I think you can use my trick, to remove null rows at one go instead of using data cleansing.

hellyars
13 - Pulsar

Curious about how to search across fields etc.  But, the flipping field names was sufficient to account for schema and field name variability

Labels
Top Solution Authors