YXDB Files derived by different processes have same content but different size
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hey,
I'm currently restructuring some processes. So far, we have retrieved some data from a Google Big Query instance with help of the Alteryx python tool / Google Big Query tool directly in an Alteryx workflow. The data was written to a YXDB and has a size of roughly ~ 3 GB.
Due to multiple reasons we're now retrieving the data directly through python via Databricks and store it as a CSV. Because some Alteryx Workflows will be using the data, we read the CSV in, adjust the field types and sizes and store it as a YXDB. The resulting YXDB has a size of roughly ~ 6 GB.
I checked the contents of the files (on samples) and the available metadata on column names, types and size and they all seem to be exactly the same. Do you have an idea why the second YXDB is double the size of the first YXDB?
- Labels:
- Datasets
- Output
- Transformation
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @huettth ,
Can you provide your workflows and some sample data so we can look at the processes?
Thanks,
M.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Unfortunately this is not possible due to licence and privacy reasons. The workflows are very simple though.
1. The first consists out of 4 Tools (Python Tool, Select Tool, Formula tool (to append run date) and Output Tool).
2. The second approach basically uses the same python code which is used in the python tool in alteryx, but writes it to a CSV File (UTF-8). The Alteryx workflow then uses an Input Tool, a select tool (to adjust the fieldtypes and sizes), a formula tool (to add the run date) and an output tool.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
You mentioned that you had
checked the contents of the files (on samples)
Suggestions to troubleshoot: Read in both YXDB files, then
1) use the Field Info tool for each file then a Join tool, joining on field name, data type, field size, precision
2) compare total record counts
3) Join each row from File 1 to File 2 using a Unique Key. If a Unique Key doesn't exist, try creating one by concatenating all field values into one field.
If all of the above steps show no discrepancies, I would suggest contacting Alteryx support.
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hello @ChrisTX and thanks for you suggestions.
1) I did this already: name,type and size of the fields are all the same.
2) The total record count is exactly the same.
3) I used a sample of records because the database is constantly updated and so far we did not manage to pull the data at the exact same time. If comparing the full datasets I get these results:
matching values: 135,263,499
different values: 210,963
To achieve this comparison I sorted the records, gave them a record ID, transposed the dataset and joined on record ID and name to compare the values. A cause ofr the different values could either be an incomplete sorting (e.g. missing some relevant values to sort by, which lead to different order of records) or that values have been updated in between.
Different values are e.g. updated timestamps or codes, so the length of characters etc. usually is the same. So I do not expect that to be the reason for 3 GBs of size difference, especially as this remains like this if the workflows run again.
I do load the CSV from a S3 Bucket with the S3 tool and I set the field length to 10000 to not risk any cells being truncated. Might this be a reason for the increased file size of the YXDB? I would not understand why, since the files seem to be identically regarding characteristics which should be relevant to file size.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
If the name,type,size of the fields are all the same, it shouldn't matter that the data originated from a CSV file.
Suggestion to continue troubleshooting: Try deleting the records where any field value is different, then saving 2 new YXDB files, and see if the file size is the same.
I agree the file size should not be double if there are only "some" / a few records with different values.
But until you can create two YXDB files with exactly the same meta data and data, it may be difficult to approach the support team.
Chris
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I did some more investigations:
1. I transformed the CSV to YXDB with Alteyx Desktop and using regular in- and output tools -> resulting in a filesize of ~3,4 GB
2. I transformed the CSV retrieved via an S3 Download tool from s3 and output it with the regular output tool using Alteryx Desktop -> resulting in a filesize of ~3,4 GB
3. I input the YXDB file retrieved from the Alteryx Server workflow (that uses s3 in- and output tools) using Alteryx Desktop and the regular in- and output tools, only using a select tool in between -> resulting in a filesize of ~3.4 GB
This leads me to the conclusion that either the workflow on the Alteryx Server or the s3 output tool is causing the increase in file size.
Are there any settings which are per default different between the alteryx server and the desktop version and that I should check because they are known to have an influence on file size?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I can't think of any configuration that would cause a difference in output file size, when running on the desktop versus server.
Can you run your workflow on the desktop & the server, then compare file sizes? This could rule out one of your remaining possible causes.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
It took me a while to get back to this topic. I have not been sucessful in running the exact same workflow on the Server and on Desktop, as running the workflow on Desktop results in an error in the S3 upload tool:
"Error from AWS: The request signature we calculated does not match the signature you provided. Check your key and signing method."
If I reduce the input file (only load a small share of records) everything works finde, so I assume this has nothing to do with the provided key.
Using a standard output tool instead of the s3-upload tool and writing the file to my drive, once again, results in a filesize of ~3.4 GB. For me, the pattern only allows the conclusion that running the workflow on the Alteryx Server leads to the increased filesize.
While researching, I found a post about a text input tool causing increased file sizes together with the AMP engine. I'm not using a text input tool, but maybe some default settings on the Alteryx Server differ from settings on Alteryx Desktop and thus are leading to the increased file size?