Hi Alteryx Champions,
Need help in extracting as depicted in Regex-1.png however the text is getting split as depicted in Regex-2.png.
I need the entire row "Tell us about the states, provinces, and territories you were in during the tax year" in one place and "Done" in the second one, however it is getting split into 3
Solved! Go to Solution.
Can you upload some sample data that's representative of your input? It looks like some "Done"s appear on separate rows which Regex cannot handle. If done is always the last word, you can use the function GetWord([text], CountWords([text])-1) in a formula to grab it without needing Regex. A little more information would be helpful.
That's the input? or the expected output? Thank you for sharing. Would you mind also including what your input looks like once in Alteryx (before any manipulation)? If it's giving you weird spacing/returns for questions, the data cleansing tool is your friend!
Are you able to upload a sample workflow? or share a screenshot of your input tool settings? What file type are you using? The difficulty with the images that are being shared is that there is little information to go off of and there seems to be quite a bit of variability in even just the two rows of input that I can make out in "Regex-2.png".
Fundamentally, you need to identify the underlying structures in your data to accomplish anything useful in Alteryx. That structure is what I am trying to ascertain, but I still need more information to provide the appropriate assistance.
I feel very confident at this point in what your desired output needs to look like, but how to get there from the input still needs further investigation to determine exactly what the input is (which can be helped with answers to the questions provided).
Thank you very much! It took some finagling, but I managed to get things into working order. It's a little hard-coded, but shouldn't be too hard to adjust as necessary. My first recommendation is to change your R code, using the method pdf_data(FullPath) instead of pdf_text(FullPath). The difference being that pdf_text() parses through all the text in one large block, where as pdf_data() stores the location of each word in a Tibble.
Thank you so much for the solution, however I am unable to update the path of the file to make it dynamic for the users to select different file with more pages. Current workflow just has one page, we generally have a about 50-60 pages in the PDF. It worked fine with the one page of the PDF