This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
When training people on the use of action tools, something that I always have to hit on is that when you are telling the tool which piece of the XML that you are adjusting, it's sort of difficult to tell what you have selected, and super easy to accidentally select something else.
When you initially select the action to take it's this nice Blue Color. However, it still doesn't feel exactly like you have actually selected anything or told the Action Tool what to do, since it's so easy to just select any other one of these actions.
A slightly different problem is that if you are selecting an action that has been previously configured, it is just this light grey color. So it can be easy to accidentally change your settings because you may not realize it's actually set up.
Here is a recent community post that sort of outlines a few of these problems.
When the Python Tool operates, it seems to always ingest all the data before processing any of it (i.e. no batch processing). Python can handle this type of functionality with generators, can we update the tool so that it may do some preprocessing (like imports and data prep) and allow a defined generator function to be called repeatedly from a separate input handle and provide batch data frames on output for more parallel-like processing of data?
The Python Tool could be updated as such:
Multi-Input - Same functionality as now, and also allow this data to be used for preprocessing and setting up the Python functions and a single batch function.
Data Input - Ingests data in batches (as most other tools operate) where each batch passes in a dataframe (in this case, a subset of processed entries) into an existing Python function (with a name that is in globals()), and returns another dataframe with that desired output. This can give the option of adding/removing rows as necessary to a subset of the data.
Data Output - Partial set of data after data processing to allow tools further in the chain to process in parallel.
"On Complete" Multi-Outputs - Same functionality as now, to pass process-complete data to the next tool once all data ingested has been processed. Perhaps give the option to pass the complete set from Data Output.
A simple use-case, if a user wanted to use only the Python Tool:
Let's say a user wants to get all URLs from every post in a thread (containing millions of posts) that are in blacklisted domains.
Data prep that sends the list of blacklisted domains into the Python Tool's Multi-Input handle, and that data is transformed and stored in a set within the Python tool once.
A series of posts (strings) are sent in batches (let's say ~10000) to the Data Input of the Python Tool. The tool calls a defined Python function that extracts all the URLs, and filters those in the blacklist.
That data is then transformed into a DataFrame which is then sent to the Data Output of the Python Tool, and only contains results corresponding to the small batch of posts that were ingested. Alteryx can also use this to track progress during execution.
Once all posts have been processed, one of the Python Tool's Multi-Outputs can return a total count of URLs found that were NOT in the blacklist (sure this can be a part of the Data Output, but just for the sake of this example). Could also be used to trigger "on-complete events."
I know I used the term "generators" above, and the design could probably be simplified to instead call an Alteryx Python function that yields from a function to await input from the next batch to use actual Python generators. However, I feel my initial approach could be thought of as a simpler process since generators are more of an intermediate functionality.
I hope this makes sense and is elaborate enough to pursue. Thanks for the consideration!
Was very happy to see the Bulk Loader introduced for Snowflake during last release. This bulk loader is specifically available for Snowflake environments that are hosted on AWS, but does not provide functionality for those environments using Azure. As Snowflake continues to build momentum, I imagine this will be a common request. Is there something in the pipeline to add this functionality?
For an interim solution, we will be working toward developing some generic scripts/snowsql to mimic that bulk load, but ultimately we'd love to have this as part of the tool.
Wanted to control the order of execution of objects in Alteryx WF but right now we have ONLY block until done which is not right choice for so many cases
Can we have a container (say Sequence Container) and put piece of logic in each container and have control by connecting each container? Hope this way we can control the execution order It may be something looks like below
I've seen this question before and have run into it myself. I'd like to see a new tool that would allow a developer (of a workflow) to choose a path of logic based upon criteria known only during the execution of a module.
If LEFT INPUT Count of records < 10,000 THEN Path1 (e.g. use a calgary join)
We need some way (unless one exists that I am unaware of - beyond disabling all but the Container I want to run) to fire off containers in particular order. Run Container "Step1" then Run Container "Step2" and so on.
With the release of 2018.3, cache has become an adhoc task. With complex workflow and multiple inputs we need a method to cache and save the cache selection by tool. Once the workflow runs after opening, the cache would be saved at the latest tool downstream.
This way we don't have to create adhoc cache steps and run the workflow 2X before realizing the time saving features of cache.
This would work similar to the cache feature in 11.0 but with enhanced functionality...the best of the old cache with the new cache intent.
While In-db tools are very helpful and cut down the time needed to write complex SQL , there are some steps that are faster by directly writing SQL like window functions- OVER (PARTITION BY .....). In Alteryx, we need to create multiple joins and summaries to perform a window function. It would be immensely helpful if there was a SQL editor tool for in-db workflows where we can edit the SQL code at any point in the workflow, or even better, if they can add an "edit" function to every in-db tool where we can customize the SQL code generated and then send to the next tool.
This will cut down the time immensely and streamline the workflow to make Alteryx a true contender for the ETL solution space.
Currently pip is the package manager in place within the Designer. Unfortunately this is something that doesn't fit our requirements as Data Scientists. We prefer using conda due to the following reasons:
condamanages also non-Python library dependencies. This waycondacan beused to manage R packages as well which comes in quite handy (even tough not all packages fromCRANRepository are available)
condaprovides a very simple way of creating conda envs (similar to virtualenv but with conda one can also install and manage pip packages --> virtualenv cannot install conda packages!)toisolate required packages (with specific versions) used in a workflow (e.g. for a Python Tool in Designer).
So I would like to havecondainstead or additionally to pip and would like to createmy condaenvswhere I install the packages I need for a specific task within my workflow. Moreover, if you think about to feature an R jupyternotebook capability (like the Python Tool) it could be beneficial to change from pip tocondafor managing packages in both worlds.
One of the common things that we need to do, is to take a delta-copy of a file or a DB table into the staging area of the analytical database.
This always looks very similar - so it would be useful to make this a wizard based process so that teams can easily build these very quickly rather than having to hand wrap:
- Check which primary keys exist - fill the gaps where they don't
- Are there any rows that update over time (or is this insert-only) - if they update over time, which column is the "updated date" column so that we can spot updates - if there is no update date; then we need to do a column by column check of some kind (like a hash or a checksum)
- Do you want to sync deletes?
- Do you want to keep updates?
- Target table in staging area which is now updated compared to the source
- Logging done (similar to what Kimball recommends in the ETL Handbook) with the run date/time; summary stats; and any errors
- Errors table for any errors that arose with row numbers
- Tables in target created (with history table if requested)
Essentially, I want to update a DB table with either an update or with the deletion of rows. I can't delete all of the data. My work around will be to create/insert into a table the keys that i want to delete and try to use a input/output tool with SQL that performs the delete. Any other suggestions are welcome, but a tool is best.
As we do more work analyzng the canvasses that our folk are producing - it's becoming more and more necessary to have a well documented definition and schema for the XML that is used for Alteryx Canvasses.
Please could you publish the full XML definition and schema for Alteryx canvasses - this will allow groups to perform deeper analytics on how people are using Alteryx, automate quality checks; look for learning gaps; scan for dependencies etc?
This got me to think a little more about localized logging options in Alteryx.
At a high level, there are ways to accomplish this in Designer at a User or System level by enabling a Logging directory and then parsing those logs with a separate Alteryx job. However, this would involve logging ALL Designer executions, which seems like it may be overkill for this need. A user can also manually save a log after each execution, although this requires manual intervention.
I think adding an option in the Runtime settings for Workflow Configuration to Enable Logging and (optionally) specify a Logging directory would be a great feature add for Designer. In my opinion this should not apply once a workflow runs on Server (Server logging should be handled in a fully standardized way), but should apply to designer "UI" execution. Having the ability to add a logging naming convention (perhaps including a workflow name and run date in the log name) would be icing on the cake.
This would allow for a piecemeal logging solution to log specific flows or processes that might be high visiblity or high importance, while avoiding saving hundreds or thousands of logs daily of less important processes, and of dev test. It would also reduce or eliminate a manual process to save these logs individually.
A cahce tool would allow a user to temporarily store a snapshot of inline data from previous run of the module.
Imagine a browse tool that was inline as opposed to a terminus tool (input and output). Now allow that browse tool to persist its data after a run of the module. When an option on that tool was activated, it would block all of the dependent tools upstream from it and instead send its cached data downstream.
The reason I think this would be a useful tool is that I often come to the end of creating a module when I'm working on the Reporting tools. I run multiple times to see the changes I've made. When the module has a lot of incoming data and complex data transformations, it can take a long time just to get to the point where the data gets to the reporting tools. This cache tool would eliminate that wait.