We’ve extended Inspire Early Bird Pricing until March 1. Register now and enjoy 20% off conference passes and 10% off training passes. P.S. Don’t forget to bring friends! When you sign up for five or more tickets, you get an extra 20% discount on conference passes. Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parsing Web Data with Commas Embedded In A Field

ssutton
6 - Meteoroid

I am relatively new to Alteryx and thus far been successful with most data parsing tasks.  I have recently started to experiment with some public data available from https://download.bls.gov/pub/time.series/ap/ and it has posed a challenge for me.

 

The file in question is located here:  https://download.bls.gov/pub/time.series/ap/ap.series

 

This file is a challenge because the "field" called series_title is filled with individual elements that are separated by commas. 

 

It is difficult for me to determine how best to read this file successfully using the "download" tool and properly parsing that particular LONG field.  If it has been "quoted" on both ends, it would be easier.  

 

The data from that file looks like the following:

 

series_id	area_code	item_code	series_title	footnote_codes	begin_year	begin_period	end_year	end_period
APU0000701111    	0000	701111	Flour, white, all purpose, per lb. (453.6 gm) in U.S. city average, average price, not seasonally adjusted	          	1980	M01	2017	M10
APU0000701311    	0000	701311	Rice, white, long grain, precooked (cost per pound/453.6 grams) in U.S. city average, average price, not seasonally adjusted	          	1980	M01	1981	M12
APU0000701312    	0000	701312	Rice, white, long grain, uncooked, per lb. (453.6 gm) in U.S. city average, average price, not seasonally adjusted	          	1980	M01	2017	M10

 

So my question is how would a person successfully read this file structure directly from the web?

 

I've included the experimental workflow I constructed for reading this file.  Clearly things go bad once it gets to the seriestitle field for obvious reasons.  Hopefully this is an interesting puzzle to someone.  I'm curious how to best solve this scenario since I can imagine it will occur from time-to-time.

 

Thanks in advance for comments and suggestions.

 

-Stew

 

 

 

4 REPLIES 4
KaneG
Alteryx Alumni (Retired)

Hey @ssutton,

 

You are actually so very close... You have the first T2C tool to split newlines, remove the Data Cleansing tool and then split in the 2nd T2C on tab (\t). 

 

When I tried to copy your sample above into a Text Input, it all split apart perfectly which indicated that there must be a delimiter. In this case it was a tab separated file.

 

Kane

ssutton
6 - Meteoroid

@KaneG,

 

Thank you but I must be missing something since I don't get the same result. I also determined that tab was the delimiter but for some reason I can't get it to behave correctly.  Here is my "simplified" / "modified" workflow with the suggestions you made, yet it does not behave as I expect it would.  I must be missing something important (perhaps it is staring me in the face!)

 

Output.PNG

 

Notice how the columns for area_code, item_code, etc. are "blank"?  It is not seeming to parse correctly for me.  Not sure why?  It is as if there are extra "tabs" there but I don't believe so.  More of a configuration error on my part I believe.  Here is the workflow that produces that result (uploaded as Sample-Workflow-02 for inspection.

 

Hopefully my error is entertaining.

 

Best regards,

 

-Stew

KaneG
Alteryx Alumni (Retired)

Your 2nd T2C Tool has ' \t' in it rather than just '\t'. And so the T2C will be splitting on spaces and tabs.

 

Kane

ssutton
6 - Meteoroid

Thank you Kane!  I had a suspicion that it was something so easily missed.  Good lesson learned here on my end.  Much appreciated.

 

-Stew

Labels