Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Subtracting two files with large datasets

vizwiz1
5 - Atom

Hello Community,

 

I am fairly new to Alteryx and am having a hard time figuring out a problem.

I have 2 files (csv):

Total File sample (File A)- This file has all the items in a system (a superset if you will) (100M rows)

Failed items (File B)- This file has all the failed items which did not enter the system (300,000 rows)

 

Now, I want to get the passed items, which are the ones which are not in File B, but in File A (essentially A-B).

Also, I want to add a column which says if the item Passed/Failed to ultimately calculate the pass %.

 

How would I get one file with all the items and a column which mentions if an item passed/failed on a particular date?

 

Any help is much appreciated.

I have attached the 2 sample files 

 

6 REPLIES 6
MarqueeCrew
20 - Arcturus
20 - Arcturus

@vizwiz1,

 

Here's a starter workflow to help you.

 

Cheers,

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
LordNeilLord
15 - Aurora

Hey @vizwiz1

 

A join tool will do this perfectly....anything from the left will be a "pass" (they only appear in the Total file)...Anything from the join will be fails (they appear in both Total and Fail)...here's an example..

 

PassFail.PNG

 

@LordNeilLord

Part time Tableau, Part Time Alteryx. Full Time Awesome


Data Lover

vizwiz1
5 - Atom

Thanks LordNeilLord !

 

As i mentioned I am joining csv files with a large number of rows (about 100M).

I followed the process mentioned, but I have some questions:

1. Should item number be a string or an int32 (it is a 6 digit item number)? I ask this as i tried doing both and int32 takes significantly longer than string to run and the file size blows up as well.

Should insert date be a date field or it can be a string as well?

2. I am using 3 join conditions (item number, date and source). Does this create a Cartesian join?

3. In my output, the countd(pass+fail) does not match the total countd of item numbers. I suspect this maybe because of some noise in the data. What data cleaning would you recommend? I am currently only clearing whitespaces.

 

Thanks a lot for your help!

LordNeilLord
15 - Aurora

Hey @vizwiz1

 

No problem, glad we could help!

 

I'll try and answer your questions ( @MarqueeCrew may have different opinion)

 

  1. As Item number is an identifier, I think it should be a string...my personal choice
  2. If you're join on multiple conditions, this should help reduce the results and is better...a cartesian join is the opposite where you join everything to everything (an append)
  3. If you count of Pass + Fail doesn't match your original number of items then something is wrong...do you have items that can be a pass and fail?

@LordNeilLord

Part time Tableau, Part Time Alteryx. Full Time Awesome


Data Lover

MarqueeCrew
20 - Arcturus
20 - Arcturus

@vizwiz1,

 

I'm in agreement with @LordNeilLord.  An Item Number is a STRING!

 

A number is something that you can ADD or AVERAGE.  You can't add a phone number or average a ZIP code.  For those reasons, I vote "String".  

 

If you're concerned about your join keys, take a UNIQUE tool and make sure that they are unique.  Using more join keys than is necessary is wasteful.   But, better safe than sorry.

 

It is worth the investigation to find out why the P+F <> Total.  Something is rotten in the state of Denmark.

 

Cheers,

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
vizwiz1
5 - Atom

Thanks @LordNeilLord and

 

 

 

Labels