Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

PDF to text - extracting table with empty cell value

nickolasau
5 - Atom

Dear Alteryx community, 

 

I am trying to use the  PDF to text tools to extract a table in a pdf file. However the table contains empty cells in several column as below:

 

Snipaste_2024-03-14_16-48-00.png

The resulting output from the tool are text strings as below:

 

Snipaste_2024-03-14_16-49-57.png

I was trying to use regular expression to identify those blank cells. However from the regex perspective, it seems that those blank cells are read the same as those spaces between 2 consecutive numbers in a row. For example in the 1st row, the white space between (285,066) and (2,328) is equivalent to the long white space between (2,328) and 175 so it's unable to distinguish between them 

 

Is there any good way to separate these text string to column, taking into consideration the fact that those long white space is a blank cell. Attaching the desired output for easy reference:

Snipaste_2024-03-14_16-59-34.png

Thank you in advance!

Nick

5 REPLIES 5
alexnajm
17 - Castor
17 - Castor

What's your Regex expression?

nickolasau
5 - Atom

Hello Alex,

 

To make myself clear, I wasn't tryin to use regex for the whole string as there are various row heading (or maybe empty heading). Instead, I tried to use regex_replace to locate those long white space and insert a separator. The regex I tried include

 

- \s{1}\s+\s{1}   ( I find out this one is identical to a  \s+ lol)

-\)*\s+\(*\d* (this one is not able to distinguish between long/short white space)

 

I started my Alteryx journey two weeks ago and currently with very limited knowledge on writing regular express (even Alteryx itself!) and I'm looking forward to the insight from the community! Thank you.

 

Nick

alexnajm
17 - Castor
17 - Castor

I might look at then placing a zero where you have something like 3+ spaces (hard for me to tell what amount of space constitutes "long" whitespace. Something like: RegEx_Replace([value], "\s{3,}", "\s0\s")

 

If you could paste example values, that would be useful! Or even the workflow with the sample values stored in a Text Input

nickolasau
5 - Atom

Thanks Alex for your reply.

 

Yes I think the way you mention is the solution for the moment - specifying an amount of space to define an empty column. My current fix is REGEX_Replace([value], "(\d+)\s{19}", "$1 |0| ") to add a zero and delimiter in the empty column. Thank you and appreciate it!

alexnajm
17 - Castor
17 - Castor

Looks like a great solution!

Labels