When I investigated why one of my Postgres instances crashed, it turned out the disk space got eaten by a temp file. This specific instance runs on AWS / S3 / EMR.
The temp files that are created are in an s3a directory and are fairly large
This seems to be some caching, I noticed they vary in size depending on the job data size.
Can they be deleted and circumvented from being created in first place ?
Solved! Go to Solution.
Also, this is an 5.1 instance we are just testing, in case this makes a difference, we used the AMI version. Thanks in advance.
Hi @Mad Hatter? , this may be related to the temporary directory on the Trifacta node that's being used for buffering when uploading to S3.
Try modifying the location in the /opt/trifacta/conf/trifacta-conf.json
"filewriter": {
max: 16,
"hadoopConfig": {
"fs.s3a.buffer.dir": "/tmp",
"fs.s3a.fast.upload": false
},
...
}
For your reference, this is outlined in https://docs.trifacta.com/display/r051/Enable+S3+Access?os_username=tr82r051usr-&os_password=%5E88f2Sl2mG0A2l1239F%5E
In the above, you can disable buffering by setting fs.s3a.fast.upload to false , that should take care of your temp file problem.
Regarding the side effects of setting "fs.s3a.fast.upload": “false” , other the decreased performance it causes files that Trifacta is working on S3 will be downloaded fully to the edge node
For example, if jobs are writing big files, at the publishing step those files get downloaded to the edge node. This in turn can cause the edge node to be running out of disk space
Thanks @Sebastian Cyris? , will go ahead and try it out.
Thanks @Sebastian Cyris? , this did the trick.
didn't really see any impact performance wise, since pretty much all of the jobs are scheduled