Hello clever people,
I've built a macro that goes to a 3rd party website and fetches a considerable amount of data (we're talking millions of rows, dozens of columns). It is very stable and has delivered great value to a lot of people. The users input their API key, the model ID, the table ID, and click the run button.
It is, however, slow. It takes about 9-10 minutes to fetch about 1M rows, and it's not uncommon for me to get as much as 10 M rows. The data changes frequently, so I do need to update the whole thing every time.
The 3rd party has a page limit of 10 K rows and 10 MB, and they can't change that. However, they have recently been two new pieces of information that I learnt:
1) the 3rd party developed a new API endpoint
2) the 3rd party can issue multiple API keys for service accounts
So with that new info: can I finally make my macro faster?
Now I will explain how the version of the endpoint works.
This new endpoint returns a unique page key given the target page size, table ID and model ID. So if we have a table with 15000 rows and we set the page size to 5000, we can call the data using these three URLS:
https://api.blablabla/pageSize=5000&key=[###]
https://api.blablabla/pageSize=5000&key=[###1]
https://api.blablabla/pageSize=5000&key=[###C]
My macro is an iterative macro within a standard macro. The iterative macro fetches all the URLS that are needed to get the data from the target dataset. The standard macro takes those URLs and sticks them into a download tool, then parses out the JSON.
I'm looking for creative ideas on how to make this whole thing work faster, considering that I can use several different API keys. I want to get a substantial speed improvement (x2 at least).
What I have tried so far (and nothing produced results):
1) In the standard macro piece, tile the output of the iterative macro and parallel the downloads
2) Increasing the number of connections in the download tool - cannot go over 32
Many thanks in advance for the help to this wonderful community!
Probably depends upon how you'd write it. Like in Alteryx - I'd recommend using multiple rows fed into a single download tool vs an iterative macro or something like that. a batch macro would be useful is core memory is an issue.
A Python script (outside of Alteryx) could be faster depending upon how you code this. You are stlil going to face memory and bandwidth issues. If this is written poorly - it will not be faster. I do not love gsheets native support for Python - and expect this might be an issue for you down the line. Python in Alteryx will not work.
If this was my project - I'd dump the macros and do this entirely in a standard workflow. I do not believe an iterative macro can ever help performance. A batch macro can because of memory allocation (at least on some versions of Alteryx).
You're current performance sounds decent - and my next step would be to do this via python in a containerized cloud environment on schedule (ie ECS/EC2) - and then push the data into my DB/S3.
The bottleneck of my process is getting the page keys, and I really don't see a way to do that using anything besides an iterative macro. The API call returns current key and next key, that's why I'm looping through the list one by one. Perhaps there is a better way? I've never used batch macros before.
So many of these apis use a rowcount/offset system - with call 1 returning the total count. This allows you to use generate rows and then use floor/ceil division to create the offset commands needed to create the syntax needed for successive calls. If you do not have that row count - you will not be able to do that and you would have to use an interative macro. Is the API documentation public - and if so - can you link to it?
The documentation isn't public but - I used this exact approach in the previous version of my macro: getting the total count of rows and splitting it into "batches". Then for every batch (record ID) I would run the same iterative macro (green on this screenshot) getting the data. The batches could run in parallel and then the outputs would get unioned together.
Are you saying that I can use a batch macro instead of an iterative macro here?
I figured it out - many thanks, @apathetichell !!
All this time I did not realize that the download tool can ingest multiple URLs at the same time.
So I completely removed the iterative piece and instead created a formula that assembles urls based on the total page count:
api.blablabla/pageindex=1&pagezie=9000
api.blablabla/pageindex=2&pagezie=9000
api.blablabla/pageindex=3&pagezie=9000
etc
And this whole thing goes in a download tool, no problem.
I was able to decrease my macro run time almost x5 times!!
Great to hear @m_v - can you mark my suggestion as a solution? I'm a big of the multiple call/offset/limit approach.