I have a Dataprep flow which has been working, but recently has started to throw exceptions
java.io.IOException: Unable to create dataset: temp_dataset_beam_job_*_*
The exceptions occur in the processing of the flow 9 times, and ultimately the job fails. It seems a little intermittent - but it is ultimately stopping my flow from processing.
The strange thing is that the dataset does exist - and contains a table with data in it. It is owned by the Id@compute@developer.gserviceaccount.com account which is running the flow.
Is this a misleading error message, or is there an issue with my permissions somewhere?
A quick update on this - the same flow will succeed sometimes, and sometimes fail. It can report 1 or 2 of the above errors and retry/continue to succeed, or fail. I think it is therefore not permission related, as it can work. I'd be grateful for any suggestions, as it feels like it may be an issue with the BigQuery API or platform?
Solved! Go to Solution.
Hi, Angus--
Just guessing: this sounds like a connectivity issue. Have you tried:
* Pinging the server
* Accessing the table through BigQuery from the same computer
* Trying to run the flow from a different machine
* Does the flow have dependencies on other flows or datasets?
As a reminder, this is a public forum, so I would recommend caution in posting userId's and other PII here.
Cheers,
-SteveO
Hi
Thanks for the reply.
I can access the data through BigQuery without any problem, and part of the flow does work sometimes, but with intermittent errors.
The flow is pulling data from two BigQuery views and a CSV file stored in Google Cloud Storage. It doesn't use any reference datasets. Having said that, I re-created the issue with a simpler flow which just pulled data from one of the dataset, changed the type of one field, and then output to BigQuery.
This is all running on Google Cloud Platform, so I'm not sure I can run the flow on a different machine - it appears to be a GCP connectivity issue ?
This is affecting my flows too - seems to have started a few days ago. Not sure how to follow the diagnostics steve suggested as running entirely on GCP.
I can access the table in question when collecting a random sample, but when running a job on even a simple flow that just queries the table it fails with this error. Even more curiously, the collecting a sample dataflow job spits out one of these errors to the log (but still succeeds). The ones that fail spit this message to the log multiple times
Hi
A further update on this. The input to the flow in question are views which query GA360 data.
By running the view into a table in BigQuery, and modifying the flow to use that table, it appears to process successfully.
Neither view is necessarily huge -(in BigQuery terms). One returns 16,000 rows, the other 1.8million. In BigQuery, they execute within 30 seconds.
Are there limitations in using Views as inputs to Flows in DataPrep? The reason for doing this is that I want to reduce the data being processed - the view is designed to only query the partitions since the last time I ran the flow.
As an aside, assuming this is the cause, should I expect a more meaningful error message?
Thanks
Angus
Hi, guys--
Sorry for the confusion: I was trying to suggest that you try to run the flow on a different machine to test if it was some kind of local or connectivity issue.
However, since Stewart replicated the issue and you verified it with a simpler flow, you should create a Support ticket for this one.
Cheers,
-SteveO
FYI, as of October 2019, this issue has been resolved.
Good