Stripping RTF tags
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I have a dataset with a bunch of RTF tags, which I need to strip and keep the plain text. The output will be a CSV, so I don't need to keep any formatting, only the text.
Here are some examples of the tags and what the expected output after stripping is.
Input with RTF | Expected Output |
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Microsoft Sans Serif;}} \viewkind4\uc1\pard\f0\fs17 left vm \par } | left vm |
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Microsoft Sans Serif;}} \viewkind4\uc1\pard\f0\fs17 $1894.04 EFT\par } | $1894.04 EFT |
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Microsoft Sans Serif;}} {\colortbl ;\red0\green0\blue0;} \viewkind4\uc1\pard\cf1\f0\fs16 N/a, didn't leave a vm, will f/up\cf0\fs17\par } | N/a, didn't leave a vm, will f/up |
{\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fnil\fcharset0 Microsoft Sans Serif;}{\f1\fnil Microsoft Sans Serif;}} \viewkind4\uc1\pard\f0\fs17 12/19/14- lef vm regarding saturday meeting again. will follow up in January 2015 if they do not call back.\f1\par \par \f0 12/15/14- left vm regarding saturding meeting week of the 22nd, will follow up Friday if I do not hear back.\par \par \par 12/4- left vm. MD\f1\par \par \f0 11/26- left vm. MD\f1\par \par \f0 11/4- left vm. MD\f1\par \par \f0 thx for attending seminar. confirming Nov 1st sat, 1 or 3pm??\par } | 12/19/14- lef vm regarding saturday meeting again. will follow up in January 2015 if they do not call back.12/15/14- left vm regarding saturding meeting week of the 22nd, will follow up Friday if I do not hear back.12/4- left vm.11/26- left vm.11/4- left vm.thx for attending seminar. confirming Nov 1st sat, 1 or 3pm?? |
Is this something that can be done in Alteryx? Possibly with a Regex?
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Something like this should work:
\\f0\\fs\d\d(.*?)\\
EDITED - fixed
EDITED 2 - fixed better 😉
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@jamielaird Thank you for the reply. I tried your regex but it didn't seem to work. On the left is the input.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@RishiK Thank you for the workflow. It's somewhat working (understandably removing special characters in the actual message, I didn't expect otherwise), but is there a way to do it without splitting the text into a bunch of lines?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
If you go down the regex route then more often than not, it will be split into separate lines whilst you're cleansing it. Is there a reason why that would be an issue?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
It was more trying to avoid the hassle of piecing them back together, but ultimately I decided to go a different route. The regex method more or less worked but it didn't preserve the integrity of the data, which ended up being more important than I thought.
In case this helps anyone in the future, there is a RichTextBox in C# which is able to parse RTF tags and convert RTF to plaintext. My solution was to export the data as a CSV, read in the CSV file in C#, strip the tags, and output the clean text as a CSV which is then read in by Alteryx. As far as I know, Alteryx doesn't have C# integration (only Python and R), so this process isn't 100% automated but it's better than nothing.
I'll accept your answer as solution since it was the best answer I received, thank you.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Just wanted to say this worked perfectly for me! Thank you so much. Would it be too much trouble for you to explain how you constructed your RegEx expression?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Ok, this one hasn't a had a lot of activity recently, but just in case others come here like me looking for an answer, I stumbled on something that worked for me which is very easy. At some point, Excel introduced a feature that if you have rich text tags in the cell, and you just go into the cell (using F2 or just clicking into it), then hit enter, it will convert it to text. Not sure this will work for everyone, but you could have a step to write a file, then perform the above step, then another workflow after using the new text field.
If you don't want to hit F2 on each cell, you can use this VBA macro:
Sub EditModeForRange()
'
' EditModeForRange Macro
Dim rng As Range
Dim cel As Range
Set rng = Application.Selection
For Each cel In rng.Cells
With cel
Application.SendKeys "{F2}{ENTER}"
End With
Next cel
End Sub
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I had a similar issue and this worked great! I really need to work on my RegEx!
