Alteryx Designer Desktop Discussions

nickolasau · ‎03-14-2024

Dear Alteryx community,

I am trying to use the PDF to text tools to extract a table in a pdf file. However the table contains empty cells in several column as below:

The resulting output from the tool are text strings as below:

I was trying to use regular expression to identify those blank cells. However from the regex perspective, it seems that those blank cells are read the same as those spaces between 2 consecutive numbers in a row. For example in the 1st row, the white space between (285,066) and (2,328) is equivalent to the long white space between (2,328) and 175 so it's unable to distinguish between them

Is there any good way to separate these text string to column, taking into consideration the fact that those long white space is a blank cell. Attaching the desired output for easy reference:

Thank you in advance!

Nick

alexnajm · ‎03-14-2024

What's your Regex expression?

nickolasau · ‎03-15-2024

Hello Alex,

To make myself clear, I wasn't tryin to use regex for the whole string as there are various row heading (or maybe empty heading). Instead, I tried to use regex_replace to locate those long white space and insert a separator. The regex I tried include

- \s{1}\s+\s{1} ( I find out this one is identical to a \s+ lol)

-\)*\s+\(*\d* (this one is not able to distinguish between long/short white space)

I started my Alteryx journey two weeks ago and currently with very limited knowledge on writing regular express (even Alteryx itself!) and I'm looking forward to the insight from the community! Thank you.

Nick

alexnajm · ‎03-15-2024

I might look at then placing a zero where you have something like 3+ spaces (hard for me to tell what amount of space constitutes "long" whitespace. Something like: RegEx_Replace([value], "\s{3,}", "\s0\s")

If you could paste example values, that would be useful! Or even the workflow with the sample values stored in a Text Input

nickolasau · ‎03-18-2024

Thanks Alex for your reply.

Yes I think the way you mention is the solution for the moment - specifying an amount of space to define an empty column. My current fix is REGEX_Replace([value], "(\d+)\s{19}", "$1 |0| ") to add a zero and delimiter in the empty column. Thank you and appreciate it!

alexnajm · ‎03-18-2024

Looks like a great solution!

Alteryx Designer Desktop Discussions

PDF to text - extracting table with empty cell value