Alteryx Designer Desktop Discussions

karliu14 · ‎08-14-2024

Hi,

I need help cleaning the text of the attached PDF/ excel exported from PDF. It is a list of New York counties with its cities. How can I clean this up so that I can have two fields (columns)- one for the County and one for the City within that county? See output example below.

Thank you!

County	City
Albany	Albany
Albany	Cohoes
Albany	Watervliet
Albany	Berne
Albany	Bethlehem

etc.

Carolyn · ‎08-14-2024

Give this a try!

Edit: Explanation -

I started with a RecordID Tool because some lines had the county with some cities listed at the bottom of one column and then more cities in that county at the top of the next column. The RecordID Tool allows me to sort later to get everything into the right order
Transpose to get everything into one column
Filter to exclude all the null rando lines
Sort by Name then Record ID to get everything into its proper order
Text to Columns to split to rows based on each Line Break
RegEx to split out the County and City Name info
Multi-Row Tool to fill in the County Name for all the Cities
Filter/Select to do some clean up

Yoshiro_Fujimori · ‎08-14-2024

Hi @karliu14 ,

My solution is almost the same as that of @Carolyn except for;

Sort Tool is not used as I think the column break should come before the page break.
For example, City "Otto" should belong to "Cattaraugus County", instead of "Ontario County".
Some conditions are added to deal with "New York City".

I hope this helps.

Workflow

Multi-Row Formula

[County] =

IF RegEx_Match([Value], "^\d.*") OR [Value] = "All Buroughs" THEN [Row-1:County]
ELSEIF StartsWith([Value], "New York City") THEN "New York City"
ELSE RegEx_Replace([Value], "(.*?)County.*", "$1")
ENDIF

Filter

REGEX_Match([Value], "^\d.*") OR [Value] = "All Buroughs"

Formula

[Value] = REGEX_Replace([Value], "\d+\s+", "")

karliu14 · ‎08-14-2024

Amazing! I have a lot to learn... Thank you, Yoshiro and Carolyn!

Carolyn · ‎08-15-2024

@Yoshiro_Fujimori

My solution is almost the same as that of @Carolyn except for;
Sort Tool is not used as I think the column break should come before the page break.
For example, City "Otto" should belong to "Cattaraugus County", instead of "Ontario County".

Shoot! I thought that I did that right but didn't have a chance to triple check. Good catch :)

Alteryx Designer Desktop Discussions

Need help cleaning text

Re: Row creation

Re: How to select columns dynamically using number...

Re: Batch macro to read 1000+ .xlsx files with var...

Re: Issue when using Block Until Done and Power BI...

Example workflow for setting up a custom list to u...