Alteryx Designer Desktop Discussions

pg_two · ‎11-15-2022

HI，all

Whether you can write data to the databricks using the output tool？How do I write the result data to the databricks?

Because I had a problem: I was using the output tool to try to write the resulting data to the databricks, but the message was not supported. Do you have any solutions?

thanks all.

DavidSkaife · ‎11-15-2022

Hi @pg_two

It doesn't look like youcan use the Output Tool to write to Databricks, you need to use the Data Stream In Tool. More information can be found here (for Alteryx Version 22.1) -> https://help.alteryx.com/20221/designer/databricks

pg_two · ‎11-16-2022

hi @DavidSkaife

When I use the input stream tool to write to databricks, he prompts me about a bug in Alteryx itself:

Error: Data Stream In (5): You have found a bug. Replicate, then let us know. We shall fix it soon.

DavidSkaife · ‎11-17-2022

Hi @pg_two

I can suggest the following, if you haven't already:

Ensure your drivers are fully up to date

Ensure your driver configuration is set up correctly as per the below

If you created the workflow in Alteryx 2022.1+ then try running the workflow with AMP Engine off, just to see if that makes a difference

If none of these make a difference and you still get the message, I'd raise a support request with Alteryx for them to have a look into it for you.

apathetichell · ‎01-27-2023

Hey - not sure if you got a satisfactory solution to this - I have never successfully written to Databricks (AWS) using the Ouptut data tool. There are multiple different types of write possibilities listed (Databricks CSV) for example - all of which error somewhere along the way.

I do - however - successfully write to Databricks using Datastream-In all of the time... If you need help setting this up - there's some nuance here - feel free to drop me a line. Basically you have to have the SQL endpoint set up in your ODBC and you need your token/PAT in both the read and write connection In-DB.

I hit that "you have found a bug" a bunch of times. I vaguely remember getting that extensively when I used summarize in-db on all of the fields prior to datastream out. On the write-side this may have had to do with improperly configured write connection string. my notes say the write Endpt needs to be https:// and needs to not end in / - it should also be set for Databricks bulk loader CSV

and while I'm commenting - Alteryx does NOT support append to existing with Databricks. you can set up a Sparks job to merge or something - but append to existing is not support. The product team told me last year they were working on incorporating Merge into it.

maamek · ‎02-14-2023

Thank you so much @apathetichell . Following you advice literally brought three days of work to an end even with an Open case with Alteryx support so thank you!

jheck

Have you had any issues writing larger datasets? I seem to have no issues writing tables of smaller sizes using what you oultined here using the In-Db Datastream-In tool, however, anytime I'm writing anything over 2GB it seems to error out. Is there a limit on Databricks that is causing the issue?

apathetichell

@jheck honestly it could be a bunch of things - but the first place I'd look is your cluster timeout policy - and where you are staging your data. If you are using a bulk loader (which you pretty much have to) you are moving it into a single location in your system - and then pushing it to Databricks (vs incremental api calls)- this can save time - but if your cluster goes to sleep during the inactivity - that won't work. You can enable driver logging in your odbc - but I'd look at cluster timeout first. If I was pushing large data - I'd probably push to S3 for staging.

jheck

I'm far from an expert on this, so I'm not entirely sure what you mean by push to S3 for staging. Is this something that can be done in Alteryx? The other limitation I have is that I'm unable to receive a shared key for security reasons.

apathetichell

@jheck - sounds like my company. Usually (always) there's a storage system intertwined with your lakehouse (ie your lake)- which should be s3 or azure blob storage. You can use this to stage (write the files to) if you have keys. If you don't have access - I'd recommend:

1) request a longer timeout on the cluster (say 30 minutes vs 15).

2) batch your writes (write every x minutes)

3) if you can authenticate to your cloud (aws/azure) via cli you can transfer the file via cli vs internally in Alteryx.