Comparing two datasets

Dear community,

A quick question on comparing two datasets.

I have two large datasets (over 100 columns and 150k rows) which are largely identical, but there are some small differences I wanted to identify. I used a method described on the forum, which uses both transpose and crosstab, in order to compare every single cell between the two tables.

The issue is that the original data, which contains a mixture of strings, numbers and dates, are all converted to V_WString with a size 2048 after they've been through the transpose/crosstab combo, which creates some unusual errors.

The best example is table A having a value of say 9.48 and table B has 09.48 in the corresponding cell. That leading zero obviously shouldn't be there (it's interesting why it even appeared to begin with) and it's creating a lot of false positive noise which masks the real differences.

Any ideas on how to deal with this? Thank you!

Data Investigation

Accepted answers

davidskaife

Hi @rtusco

Looking at the help page it has the following info on Fixed Decimal:

The length (precision) of a fixed decimal is equal to the width of the integer (left side of decimal) plus the decimal point plus the width of the scale (right side of decimal). If a number is negative, the negative sign is also included in the length.

So effectively as your values are two digits wide before the decimal point (which also includes the - sign) its keeping the length, and thus your summed value becomes 09.

You can try changing the Data Type for those results to something else such as Double using the Select tool, which gets rid of the leading 0

All comments

ShankerV

Hi @rtusco

Any possibility to share the same data's.

Or upload the workflow with sample data and where the error is prompted.

It will be easy to troubleshoot as the conversion in data types and many other factors will cause the issue.

Many thanks

Shanker V

ShankerV

@rtusco

Only one possibility to have trailing 0 in front of the number is when we have the data's in Excel in General format.

$9.44 in General and modifying to string datatype may lead to adding 0 in front.

When you convert the string in Alteryx to integer, the 0 will disappear as numbers cant start with 0.

Many thanks

Shanker V

ShankerV

Hi @rtusco

Input tested:

Output:

This approach can help to overcome your issue. If anything else please drop a reply.

If it resolves your issue. Feel free to mark helpful answers as a solution, so that future users with the same issue can find them easier!!!!

Many thanks

Shanker V

davidskaife

Hey @rtusco

If you're able to download the CReW Macros (http://www.chaosreignswithin.com/) there is a Expect Equal Tool, which may help and seems to ignore what the field data type is set to?

It requires no config and provides details on what and where data differs.

rtusco

Hi guys,

First of all, thank you so much for the prompt replies. Your suggestions are giving me ideas to try. Unfortunately, my data is rather sensitive to post outright.

I had previously tried the CreW Macro, but at the time it didn't quite provide the result I was expecting so I didn't explore it too much.

I have done some more investigating and I think I isolated the step which produces the error. At a particular step I aggregate the data, by grouping by one column and then selecting a bunch of "First" and a few "Sum" from some of the remaining columns. The column in question is subjected to SUM and its data format immediately preceding the summarise is FixedDecimal 19.2. I have in fact opted to use FixedDecimal 19.2 (and a few 19.0 and 19.5) for all numeric fields throughout the workflow in order to obtain an output that visually looks consistent when exported to excel, i.e. always the same number of digits and decimal points, even for integer numbers. Before the summarise I have two rows: 12.52 and -3.26. After summarise the same column has one row as 09.26. Any ideas?

davidskaife

Hi @rtusco

Looking at the help page it has the following info on Fixed Decimal:

So effectively as your values are two digits wide before the decimal point (which also includes the - sign) its keeping the length, and thus your summed value becomes 09.

You can try changing the Data Type for those results to something else such as Double using the Select tool, which gets rid of the leading 0

JBLove

@rtusco -

It could be happening in the Cross Tab tool rather the Summary Tool. You mentioned earlier that you're using the Cross Tab tool.

One of the configuration steps here requires you to specify an aggregation method. If you selected concatenation and did not provide a separator then "09.26" could actually be the values "0" and "9.26" concatenated.

If that's the case, then you will need to determine why the "0" row exists.

Good luck.

rtusco

Thank you all for your suggestions, you make valid, useful points.

In the end, @davidskaife was right. It appears that every FixedDecimal cell's length was fixed based on the initial data. Given there's a 12.52 (5 characters) in one of the cells and 9.26 (4 character) in the final result, it seems that alteryx "filled in the blanks" with an extra zero at the front. Using Double helped.

Quick Links

This months top contributors

atcodedog05 19458

Qiu 15866

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703