A normal question / issue / challenge...
How can you search for a specific value in a column when the column position and name might change across input files passed through a batch macro?
Macro reads in multiple files to union large table. Schema variability due to PDF to Text variability. Details below...
CONTEXT:
I have 230 input .csv files (each a page of a very long table).
I used AWS Textract to extract the table data from PDF files.
When using Textract to extract table data the output includes both the table rows and its confidence scores for each table row.
The table data rows will be at the top and the confidence scores at the bottom of each file with a few null() rows between.
The start of the confidence data will also have the phrase "Confidence Scores % (Table Cell)" in the first column in the row before the scores (see image example below).
Importantly, "Confidence" will only ever appear in this context (and only once per file).
Normally, I use a formula tool to create a KEEP field and use an expression if
if startswith([Field_1],'Confidence') then 'No' else null() endif
I then use a Multi-Row Tool with to fill-down "No" to tag all the rows that contain the confidence scores and not table data. I can then filter out the rows containing the confidence data with a Filter Tool.
if isnull([KEEP]) then [Row-1:KEEP] else [KEEP] endif
PROBLEM:
I use a batch macro to union the 230 or so files together.
Due to PDF to Text challenges, the schema is not the same.
The batch macro is configured to Auto Configure by Name setting to allow for these subtle differences.
That means that the value Confidence Scores % (Table Cell) will not always appear in the first column. It may appear in the second (and possibly the third).
It also means that the name of the first column may not always be the same (again due to the challenges of PDF to Text).
IMPORTANT NOTE:
I get rid of the confidence data rows in the batch macro so that the union output only includes table row data and no confidence data rows.
QUESTION:
How can I search for startswidth "Confidence" across multiple fields when I do not know the potential field name?
Can you do something similar to if FieldNumber <3 and startswidth "Confidence" then TEST = No
something along the lines of (not literally)
if FieldNumber <3 and startswidth "Confidence" then TEST = No
Solved! Go to Solution.
I should add.
I can't always be sure/trust that the null rows that exist between the table data and confidence scores are a reliable means to distinguish between the two sections to the degree that Confidence provides.
A variation of the question might be.
How could I implement a test that says if all fields for two rows are null then delete everything below those two rows?
Note: Confidence appears in the row immediately after the second null row and all other fields in that third Confidence row are null.
I have a bolt-together solution that involves within the macro changing the field names to generic names, filtering out non table rows, and then restoring field names for that input file.
It works, but you have to turn AMP off in both the Macro and the worksheet where the Macro will be used.
Is there still a way to search across multiple fields in a row for a value or criteria (without turning AMP off etc.)??
Hello!
I'm not sure if I understood the problem correctly, but maybe transposing data into one column and the do the filtering should allow you to do the operation on multiple column at once.
1. I numbered my records with record ID
2. Transpose all data into one column
3. Do your keep on every column with regular formula tool
4. Group by ID and concatenate the keep flag. This way I make sure that every column of every filed is filteround if any of them contains "yes" im my flag
5. Now I got ID to be filtered out
6. Joining the original data to the ID's that left
Sample workflow in the attachment :)
Kind regards
Mateusz
Hello!
Sorry, I've missed your previous post and misunderstood the problem.
But I think you can use my trick, to remove null rows at one go instead of using data cleansing.
Curious about how to search across fields etc. But, the flipping field names was sufficient to account for schema and field name variability
User | Count |
---|---|
106 | |
85 | |
76 | |
54 | |
40 |