Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Need help with Regex to extract text from string

Pranab_C
8 - Asteroid

Hi Alteryx Champions,

 

Need help in extracting as depicted in Regex-1.png however the text is getting split as depicted in Regex-2.png.

 

I need the entire row "Tell us about the states, provinces, and territories you were in during the tax year" in one place and "Done" in the second one, however it is getting split into 3 

 

 

 

13 REPLIES 13
AndrewDMerrill
13 - Pulsar

Can you upload some sample data that's representative of your input? It looks like some "Done"s appear on separate rows which Regex cannot handle. If done is always the last word, you can use the function GetWord([text], CountWords([text])-1) in a formula to grab it without needing Regex. A little more information would be helpful.

Pranab_C
8 - Asteroid

Thank you for replying, please see this

AndrewDMerrill
13 - Pulsar

That's the input? or the expected output? Thank you for sharing. Would you mind also including what your input looks like once in Alteryx (before any manipulation)? If it's giving you weird spacing/returns for questions, the data cleansing tool is your friend!

Pranab_C
8 - Asteroid

Sharing it again, First file is the input and the second one is how it is coming in Alteryx right now, ideally the output should be :-

 

Column-1                                                                                                                               Column-2

Tell us about the states, provinces, and territories you were in during the tax year     Done

AndrewDMerrill
13 - Pulsar

Are you able to upload a sample workflow? or share a screenshot of your input tool settings? What file type are you using? The difficulty with the images that are being shared is that there is little information to go off of and there seems to be quite a bit of variability in even just the two rows of input that I can make out in "Regex-2.png".

 

Fundamentally, you need to identify the underlying structures in your data to accomplish anything useful in Alteryx. That structure is what I am trying to ascertain, but I still need more information to provide the appropriate assistance.

 

I feel very confident at this point in what your desired output needs to look like, but how to get there from the input still needs further investigation to determine exactly what the input is (which can be helped with answers to the questions provided).

Pranab_C
8 - Asteroid

I am using an R Code to convert a PDF to .txt in Step-1, text to column to split the data to rows in Step-2 and then Regex to extract the data. I need help with the final step

AndrewDMerrill
13 - Pulsar

Thank you very much! It took some finagling, but I managed to get things into working order. It's a little hard-coded, but shouldn't be too hard to adjust as necessary. My first recommendation is to change your R code, using the method pdf_data(FullPath) instead of pdf_text(FullPath). The difference being that pdf_text() parses through all the text in one large block, where as pdf_data() stores the location of each word in a Tibble.

Screenshot.png

Screenshot 2.png

Pranab_C
8 - Asteroid

Thank you so much for the solution, however I am unable to update the path of the file to make it dynamic for the users to select different file with more pages. Current workflow just has one page, we generally have a about 50-60 pages in the PDF. It worked fine with the one page of the PDF

AndrewDMerrill
13 - Pulsar

I see. I modified the code and workflow to run on a pdf with multiple pages:

New Screenshot R Code.png

Labels