Let’s talk Alteryx Copilot. Join the live AMA event to connect with the Alteryx team, ask questions, and hear how others are exploring what Copilot can do. Have Copilot questions? Ask here!
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Need help cleaning text

karliu14
5 - Atom

Hi,

 

I need help cleaning the text of the attached PDF/ excel exported from PDF. It is a list of New York counties with its cities. How can I clean this up so that I can have two fields (columns)- one for the County and one for the City within that county?  See output example below.

 

Thank you!

 

County

City

AlbanyAlbany
AlbanyCohoes
AlbanyWatervliet
AlbanyBerne
AlbanyBethlehem

etc.

4 REPLIES 4
Carolyn
12 - Quasar
12 - Quasar

Give this a try! 

 

Edit: Explanation - 

  1. I started with a RecordID Tool because some lines had the county with some cities listed at the bottom of one column and then more cities in that county at the top of the next column. The RecordID Tool allows me to sort later to get everything into the right order
  2. Transpose to get everything into one column
  3. Filter to exclude all the null rando lines
  4. Sort by Name then Record ID to get everything into its proper order
  5. Text to Columns to split to rows based on each Line Break
  6. RegEx to split out the County and City Name info
  7. Multi-Row Tool to fill in the County Name for all the Cities
  8. Filter/Select to do some clean up

 

2024-08-14_15-01-10.png

Yoshiro_Fujimori
15 - Aurora
15 - Aurora

Hi @karliu14 ,

 

My solution is almost the same as that of @Carolyn except for;

  • Sort Tool is not used as I think the column break should come before the page break.
    For example, City "Otto" should belong to "Cattaraugus County", instead of "Ontario County".
  • Some conditions are added to deal with "New York City".

I hope this helps.

 

Workflow

CityByCounty.png

 

Multi-Row Formula

[County] =

  IF RegEx_Match([Value], "^\d.*") OR [Value] = "All Buroughs" THEN [Row-1:County]
  ELSEIF StartsWith([Value], "New York City") THEN "New York City"
  ELSE RegEx_Replace([Value], "(.*?)County.*", "$1")
  ENDIF

 

Filter

  REGEX_Match([Value], "^\d.*") OR [Value] = "All Buroughs"

 

Formula

  [Value] = REGEX_Replace([Value], "\d+\s+", "")

karliu14
5 - Atom

Amazing! I have a lot to learn... Thank you, Yoshiro and Carolyn!

Carolyn
12 - Quasar
12 - Quasar

@Yoshiro_Fujimori 

 


My solution is almost the same as that of @Carolyn except for;
  • Sort Tool is not used as I think the column break should come before the page break.
    For example, City "Otto" should belong to "Cattaraugus County", instead of "Ontario County".

Shoot! I thought that I did that right but didn't have a chance to triple check. Good catch :) 

Labels
Top Solution Authors