community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Alteryx MemoryError in Python tool

Asteroid

I'm trying to run 157 million rows (142 GB of data) through a fairly simple set of if-thens in python (scoring code), but it runs out of memory. I'm looking for a solution to either:

1. Set it not to run out of memory.

2. Code it to do one row at a time (which is what the code does at its core) and spit it into the rest of the workflow.

 

Good news: happy to learn that now even Alteryx has a BoatyMcBoatFace joke in it. Sigh.

 

Error message:

Error: unable to read input table "TableMcTableFace" (F:\ProgramData\Alteryx\Engine\1f4cb1ec19fef276ea938bffc4081a5f\2\4460abb7be83bae8f01b9bf1238a923c.sqlite)
ERROR: reading input data "#1"
 
---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-2-9910fafd6f32> in <module>
      1 model = Alteryx.importPythonModule('F:\\Dropbox\\Research\\In progress\\CID Step 1 Sharma\\Algorithm\\dr_model.py')
----> 2 data = Alteryx.read("#1")
      3 predictions = model.run_dataframe(data)
      4 
      5 Alteryx.write(pd.DataFrame(predictions),1)

c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\ayx\export.py in read(incoming_connection_name, batch_size, debug, **kwargs)
     32     """
     33     return __CachedData__(debug=debug).read(
---> 34         incoming_connection_name, batch_size=batch_size, **kwargs
     35     )
     36 

c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\ayx\CachedData.py in read(self, incoming_connection_name, batch_size)
    208             try:
    209                 # get the data from the sql db (if only one table exists, no need to specify the table name)
--> 210                 data = db.getData(batch_size=batch_size)
    211                 # print success message
    212                 print("".join(["SUCCESS: ", msg_action]))

c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\ayx\Datafiles.py in getData(self, table, batch_size)
    552             if self.fileformat.filetype == "sqlite":
    553                 query_result = pd.read_sql_query(
--> 554                     "select * from {}".format(table), self.connection
    555                 )
    556             elif self.fileformat.filetype == "yxdb":

c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\pandas\io\sql.py in read_sql_query(sql, con, index_col, coerce_float, params, parse_dates, chunksize)
    312     return pandas_sql.read_query(
    313         sql, index_col=index_col, params=params, coerce_float=coerce_float,
--> 314         parse_dates=parse_dates, chunksize=chunksize)    315 
    316 

c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\pandas\io\sql.py in read_query(self, sql, index_col, coerce_float, params, parse_dates, chunksize)
   1420                                         parse_dates=parse_dates)
   1421         else:
-> 1422             data = self._fetchall_as_list(cursor)
   1423             cursor.close()
   1424 

c:\program files\alteryx\bin\miniconda3\pythontool_venv\lib\site-packages\pandas\io\sql.py in _fetchall_as_list(self, cur)
   1429 
   1430     def _fetchall_as_list(self, cur):
-> 1431         result = cur.fetchall()
   1432         if not isinstance(result, list):
   1433             result = list(result)

MemoryError: 
ACE Emeritus
ACE Emeritus
When reading Input#1, you could try...

Alteryx.read("#1", batch_size = 1000)

... although I'm not sure it will help: I looked at the code on GitHub and it appears that it it reads the input in chunks, but still does all the reads in one loop, adding to one large data frame (which would probably run out of memory).

So, an alternative would be to put your Python code into an iterative macro that processes some number of rows at a time. (e.g. send first N rows into your Python tool whose output goes to the iterative macro's final output; and skip first N rows which go directly to the iterative macro's Loop output).
Alteryx Certified Partner
Alteryx Certified Partner
Use the following technique to minimize memory requirements:

If the python code doesn't affect input and output order:
1. deselect any fields not used in calculation

Use a Join tool (no key, use record order) to append the outcome to the original record.

In short, put less data into python.

Cheers,

Mark
Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and reboot. Order shall return.
Asteroid
Hehe. You just described my workflow perfectly. Already done. :-)
Highlighted
Asteroid
I agree that I’ll probably still run out of memory, but will try it for fun. Iterative macro is a good idea that I hadn’t thought of. A bit of a workaround rather than a solution, though, given that I don’t feel like my dataset is larger than what an Alteryx tool should be able to handle. In fact, the dataset itself was generated in Alteryx, and represents about half the data I had there.
Meteor

I share your pain. I am working on 500M rows and it consumes all of the memory. @KaiLarsen did you find a solution?

Asteroid
It was just bad code that required that much memory. Nothing Alteryx could do. We refactored the code and it was fine.
Labels