I have found some odd behavior on the encoding of dates in an Avro format I was hoping someone might be able to shed some light on.
I am connecting to SQL Server, say Adventureworks db, and grabbing a table, say DimEmployee, which has dates on it (dates, not datetime). Alteryx grabs the data just fine and it shows up as a date. I can then output this to an Avro file, either locally or on Hadoop, just fine. My issue is in the encoding process Alteryx converts date to a string. You can see this by looking at the schema at the header of the avro file. for example:
...
{
"name": "HireDate",
"type": [
"null",
"string"
]
},
...
Avro does support dates but it has to be encoded differently. Their documentation shows how here:
https://avro.apache.org/docs/current/spec.html#Date
Why does Alteryx force dates to become strings? I can work around it once it's on HDFS but it's kind of a pain to take the extra step.
Solved! Go to Solution.
Follow up - I should have mentioned I also tried to convert the date to an Alteryx Date and then go to Avro. It still turns it in to a string
I think the date in Avro is an int, not the same data format as in Alteryx.
So we need to calculate the difference between 1970-01-01.
DateThe date logical type represents a date within the calendar, with no reference to a particular time zone or time of day.
A date logical type annotates an Avro int, where the int stores the number of days from the unix epoch, 1 January 1970 (ISO calendar).
The following schema represents a date:
{ "type": "int", "logicalType": "date" }
Thank you Qiu for the idea. It makes sense about the 1970 date since Timestamps in Impala uses the "unix epoch" which starts on that date. I tried doing the conversion but it still isn't going to work. It now encodes it like this:
{
"name": "AvroHireDate",
"type": [
"null",
"int"
]
}
When I think it should be encoding it something like this:
{
"name": "AvroHireDate",
"type": {
"type": "int",
"logicalType": "date"
}
}
It's really Alteryx's job to encode Avro correctly.It actually says in the Avro documentation in the introduction "Implementations of Avro must adhere to this document." Apparently Alteryx isn't doing this. Alteryx should know SQL Server is sending a date and that it should encode as an Avro date. Maybe in future release this is something they'll fix.
I already have an easy solution since once on the cluster I convert Avro to Parquet and in that process I change the String to a Timestamp without any issue. Alteryx doesn't create Parquet so this is something I have to do anyway.
Thanks again Qui
Scott
As a small update this is still a problem for me but I noticed it's officially documented this is expected behavior and not a bug:
https://help.alteryx.com/20214/designer/avro-data-types
We use these Avro files with Cloudera Impala which also does not support the date data type and forces a string conversion. The only work around is convert using SQL