Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Breaking long strings into multiple rows based on character length column by whole words

alpro-23
7 - Meteor

Hi all, I am rather new to Alteryx and would like to seek help on this problem.

 

I have a column of comma-separated strings - some of which are really long. My objective is to parse them through an API that has a character limit. Hence, I need to break the comma-separated string into multiple rows with max. character limit.

 

Moreover, I would like to retain the whole word. Some of the methods I've tried would cut the word harshly while fitting into the character limit. 

 

I'm thinking this might require a multi-row formula. However, really not sure how I can do it. Would appreciate any help! 🙂

 

(I've also attached a snippet of the actual data I'm trying to work on, as a csv format. Note, it is in tamil language with encoding of utf-8 that would work!)

16 REPLIES 16
atcodedog05
22 - Nova
22 - Nova

Hi @alpro-23 

 

One of the way is you can break them into sentences per line by taking full stop as reference.

 

Workflow:

atcodedog05_0-1629358986215.png

 

Hope this helps : )

 

AngelosPachis
16 - Nebula

Hi @alpro-23 ,

 

You can probably make use of a regex tool, set to tokenize your string to the desired max character limit.

 

As you can see below, I've set the character limit to be 50 (the dot stands for any possible character). Of course you can change that to your API's character limit

 

AngelosPachis_1-1629359034501.png

 

and in my output, all strings have a length of 50.

 

AngelosPachis_0-1629358995601.png

 

It might be worth first concatenating all strings together and then using the regex tool to tokenize. In that case, you will reduce the number of records where you get a a length other than 50.

 

Hope that helps,

Angelos

alpro-23
7 - Meteor

Hi @AngelosPachis , thanks for the answer - I've tried this method but I know it will break my words apart since it tokenises character-by-character... 

atcodedog05
22 - Nova
22 - Nova

Hi @alpro-23 

 

What is the max limit. Maybe we can break into

 

1. sentences

2. each sentence into 2 to 5 parts with keeping words intact

3. break them into words.

 

Let me know your thoughts

Drussek
9 - Comet

There are many methods to do this.

alpro-23
7 - Meteor

@Drussek thanks for this solution. Are there any ways for us not to break the words up? 

 

I found that his is the common issue faced whilst trying out some solutions that you guys offered... (thanks still!)

 

 

For example, when we look at the example, the first / last word would be breaken up due to character limit. 

"ublishing software like Aldus PageMaker including versions of Lorem Ips"

 

have updated my post as well.

atcodedog05
22 - Nova
22 - Nova

Hi @alpro-23 

 

I can see you have unmarked the solution. Here is my take on the usecase.

 

Workflow:

atcodedog05_0-1629362063527.png

 

 

I am tokenizing such a way that max length is 50 and it should end with space this way words will be kept intact. If the max length is different change 50 to desired number.

 

Edit: @alpro-23  updated with minor enhancements

 

Please check and let me know.

 

Hope this helps : )

 

alpro-23
7 - Meteor

Hi @atcodedog05. Appreciate your effort and thanks for following the thread. 

 

I may have simplified my sample data way too much and I realise I have misled my audience. Sincerely sorry about it. 

 

In my real dataset that I'm working on, my string columns are actually tokenised words. They are currently comma-separated, and they are not in full sentences.

 

As I am working on a tamil dataset, it looks like this:

 

வலுவானம்,நிலைப்படுத்து,உள்,உள்,உள்,நமது,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,பங்கேற்றுள்ளம்,உள்,உள்,உள்,உள்,•,உள்,சொல்,உள்,சென்றடை,உள்,உள்,உள்,ஒருங்கிணை,உள்,நெஞ்சார்,உள்,உள்,உமான,முடி,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உள்,உ

 

What I am trying to do is to parse these words into a translator using an API, which means that there's character limit. I am not a native tamil user so I need to analyse them using English. 

 

Another problem that I faced using @Drussek 's substring is that my output string become wrongly encoded (they looked fuzzy!). 

 

In this situation, does anyone know what I can do about it? I've attached a sample csv file for clarity.

atcodedog05
22 - Nova
22 - Nova

Hi @alpro-23 

 

Would it be ok to split on commas 🤔 it's bit confusing since we are dealing with unknown input 😅

Labels