Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Community is experiencing an influx of spam. As we work toward a solution, please use the 'Notify Moderator' option on the ellipsis menu to flag inappropriate posts.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

File Transfer is extremely slow between Large CSV file to MySql

Jayasuryan
7 - Meteor

Hi Folks,

 

I am looking for help with my assignment. The assignment is that I should transfer all the rows in a CSV file which has half a million records(500K rows) into a single table in a Database. Since I have MySql installed in my system, I am trying to transfer those records into it. But the transfer is extremely slow. Could you please provide any suggestions for making the transfer fast or any other ideas? Thanks in advance.

 

Input : CSV File (86MB) (500K records)

Output: Single table in MySql

ODBC driver: Mysql ODBC 8.0 Unicode Driver (64bit)

 

11 REPLIES 11
danilang
19 - Altair
19 - Altair

Hi @Jayasuryan 

 

This could be caused by the method that you're using in the output tool.  If you just need to get all the records into an empty database make sure you use "Create Table"  This will add the records to a new table in the database with no validation.  If you use "Update: Insert if New" on a table that contains records, MySQL has to perform the following process for each record. 

1. find the record using the primary key

2. if it exists, update the records. if not create a new record.

With 500K records already in your database, this can take quite a while.

 

Dan

Jayasuryan
7 - Meteor

Hi, danilang, 

 

First of all thanks for your reply.

 

Yes, you are right, I have created an empty table with one of the columns as Primary key. This is because the assignment mandates this rule (Please find the points to achieve below). In the 'output options' present in the output tool, I have chosen as 'Delete Data and Append'.

 

 

Points to achieve

  1. Your code should follow concept of OOPS
  2. Support for regular non-blocking parallel ingestion of the given file into a table. Consider thinking about the scale of what should happen if the file is to be processed in 2 mins.
  3. Support for updating existing products in the table based on `sku` as the primary key.
  4. All 500k rows to be ingested into a single table
  5. An aggregated table on above rows with `name` and `no. of products` as the columns
  6. Ingest the same file at least 2 times, without truncating the products table.
danilang
19 - Altair
19 - Altair

Hi @Jayasuryan 

 

Unfortunately, the output method you've selected "Delete Data and Append" violates point 6, since deleting all the data effectively truncates the table. It also violates Point 2, since between the time you delete the data and reinsert all the values, the table will be incomplete.  Any other parallel process that tries to access the data in this incomplete table will be blocked by the lack of data.

 

I see two potential ways to satisfy all these criteria and still have a workflow that run acceptable fast.

 

1.  Alteryx only solution.  You'll have to work within the limitations imposed by "Update: Insert if New" to stop from blocking other applications from accessing the data.  Simply passing all the records to the database is extremely slow as you mentioned, so use the speed of the Alteryx engine to your advantage.  A way to do this is to read all the data from both the csv and the MySQL table and compare all the fields joined on primary key, to find only the records that have changed.  Only write these changed records back to the database.

 

2. Hybrid solution.  Create a temp table in your database with the same structure as your main table and create an after insert trigger that updates the records in the main table based on the values inserted into the temp one.   Use "Delete Data and Append" to dump the data into the temp table.  The database will then update the main table as quickly as possible.  

 

Dan

 

 

 

Jayasuryan
7 - Meteor

Hi @danilang,

 

As part of your first Alteryx only solution, I have a doubt.

 

You asked me to read and compare the CSV and MySQL table together. But to have the Mysql table containing all of the records itself, I should be able to transfer the records into it first from the CSV. This again comes to my initial query of creating this post, that I am unable to transfer data from CSV to MySQL table, as the transfer is very very slow.

 

Kindly reply.

danilang
19 - Altair
19 - Altair

@Jayasuryan 

 

Unfortunately, I'm not familiar with MySQL, so I can't tell you why the initial data load is so slow.  What are the specs of the machine you're running on?

 

Dan

Jayasuryan
7 - Meteor

Hi @danilang,

 

Processor - Intel i3-3rd generation. CPU @ 2.4GHz

Ram - 4GB

64bit OS and Processor.

Windows 8.

 

Or should I install any other database, which can overcome this issue? Any suggestions. And also I am not familiar with Alteryx Database. Is it relevant to fulfill this assignment?

danilang
19 - Altair
19 - Altair

@Jayasuryan 

 

The speed issue is probably caused by having only 4GB of RAM on your machine.  Both Alteryx and MySQL have minimum 4GB specifications (8GB recommended), but you're trying to run both of them on the same machine.  The OS takes up about 2GB of this leaving only 2GB to be shared between Alteryx and MySQL.  If you have any other memory intensive programs open, i.e. Chrome or any MS office application, they will reduces the available memory even more.   The OS is constantly swapping memory out to disk, which is a very slow operation.  

 

The long term solution is to get more RAM.  In the meantime, close all other applications, leaving only Alteryx designer open.  This should maximize available RAM and allow the process to run a quickly as possible.

 

As far as other databases, MongoDB, the free edition of SQL Server, etc , they'll all run against the same memory issues.  The Alteryx database (.yxdb) isn't suited to your purposes since it's really just a single highly compressed table that can only be written to and read as a whole so it would fail your criteria 2 and 6. 

 

Dan 

 

 

Jayasuryan
7 - Meteor

Hi @danilang ,

 

I applied the Alteryx only solution suggested by you. Comparing the CSV and Mysql Products table and finally ingested a single table with 'name' and 'no. of products' as asked by them. The Output data is having 'Update; Insert if new' output option for the Aggregated table ('name' as Primary Key). The first run was fine, I was able to insert the names with their count beside.

 

6. Ingest the same file at least 2 times, without truncating the products table.

 

To achieve this step, I needed to run the workflow again, I didn't touch the Mysql Products table. Everything is the same as the first run. But this time it threw an error as follows:

 

Output Data (10) DataWrap2ODBC::SendBatch: [MySQL][ODBC 8.0(w) Driver][mysqld-8.0.18]Duplicate entry 'Amy Whitaker' for key 'PRIMARY' Insert into `aggregate_table`(`name`,`no_of_products`) Values (?,?)

 

According to the 'Update; Insert if new', it should try to update or insert only if there is new data. Here its the same as the previous run with no data changes, but it says as Duplicate entry 'Amy Whitaker' for key 'PRIMARY' Insert into `aggregate_table`.

 

And also in one of the Output data messages it says,  Output Data (10) 66497 records were written to odbc:DSN=MySQL-DSN;UID=root;PWD=__EncPwd1__ (aggregate_table). But when I checked in the Mysql there is no change in the total number of records(222024 records) as opposed to the previous run, it really seems mysterious about these 66497 records that were said as written.

 

Attaching the workflow and screenshot of the messages. It will be really helpful if you could provide your valuable inputs. Thanks again Dan.

akhilSrini
5 - Atom

 

Even I have a similar kind of issue

 

Can someone suggest How to do import data from CSV to DB using any distributed fashion? threading or anything else? (I know how data looks like it has a primary key and 2 other colomns )

I need to load half a million records in less than 2 minutes?

So can I do that any suggestions would be greatly appreciated 

 

 

Thanks

 

Labels