So, I had to build a data set of Congressional marks from some really low quality OCR images.
See chart below.
My question involves the last two rows with the value Committee Recommended under WHAT -- and specifically the second to law row labeled "Need Calculated Version". The last row is generated from the OCR scan, however, it is full of errors. So I need to generate the second to last row and compare against the last row to see if there is a difference . If the value is >< 0 than there is an OCR error that needs to be corrected. What is the simplest way to do this.
This is just the walk-down for one line, the real data has hundreds of lines with similar walk downs.
ID | NAME | WHO | WHAT | TXT | AMOUNT | |
HARD CODED | N0181-07-02-4162-A | SLDTA | PBR | Budget Request | Request | 101595 |
HARD CODED | N0181-07-02-4162-A | SLDTA | HAC | Adjustment | Increase X | -6385 |
HARD CODED | N0181-07-02-4162-A | SLDTA | HAC | Adjustment | Decrease Y | -3499 |
HARD CODED | N0181-07-02-4162-A | SLDTA | HAC | Sum Adjustments | Sum of all... | -9884 |
NEED CALCULATED VERSION | N0181-07-02-4162-A | SLDTA | HAC | Committee Recommended | HAC Recommended | 91,711 |
ALREADY HARD CODED | N0181-07-02-4162-A | SLDTA | HAC | Committee Recommended | HAC Recommended | 91,711 |
I do know I can probably use some combo of a Sum tool...
Hi @hellyars
I was kind of unsure on what is the calculation used for need calculated row hence i have taken it as sum of above rows.
Workflow:
I want to separate last rows and rows except last rows into 2 blocks. Since sample tool only let you to skip first row i am reversing the order with record id and sorting it by descending. I am splitting, summing up and comparing. If it matched it would give a flag "Yes". Assuming you would have to do this for each ID i am using groupby on ID.
Hope this helps : )
@atcodedog05 Cool. Would it be possible to attach the sample workflow. I would like to see the details behind the tools.
My bad @hellyars . I have updated the post.
@atcodedog05Cool. Taking a look. My question was prompted by the need to develop a series of checks to identify discrepancies between the actual values (what they should be, or total to be in the original PDF source document) versus the values generated by the OCR scan - which were widely off across 50% or more of the scanned data.
I want to build a check for HAC Recommended - the 91,711. My data already contains a "hard coded" value. This "hard coded" value is the result of an OCR scan.
I want to create a new row that calculates HAC Recommended by either adding the sum of all the "Adjustment" Rows from "Budget Request" or sums "Sum of Adjustments" from Request and compares the results against "HAC Recommended". This will allow me to spot issues where there are OCR errors.
Hi @hellyars
Thanks, This gives me more idea about the calculation. Here is the workflow for the usecase.
Workflow:
Hope this helps : )