Hi,
I'd like to use the Python tool, but import a yxdb file from a file path, NOT by connecting an input to the tool.
The reason for this is that when you use the usual import, first the file is loaded into Alteryx memory, then again loaded into Python memory upon Alteryx.read(), so you essentially double the memory usage for a dataset. But if you only need the data for the Python process this is unnecessary and actually makes some tasks impossible.
Is there a way to do this?
Thanks!
Solved! Go to Solution.
Hi @vcarey
A really smart Alteryx developer @tlarsen7572 did exactly that by creating a Python library: https://github.com/tlarsendataguy-yxdb/yxdb-py
Important to note that this is not officially Alteryx developed, but impressive nonetheless!
Hi @vcarey
AFAIK there is no other tool that can read .yxdb files. It's a proprietary format that Alteryx has never published the specifications for. This statement is no longer true after seeing Brandon's post. You can probably ignore the rest of this post as well.
Conversely, since it is proprietary, the only tool that can produce .yxdb files is Alteryx Designer. That means that the .yxdb files that you're trying to read are created by a workflow that you probably have access to. Can you modify the output of that workflow to produce .csv files? I assume that if memory is a concern, there is too much data to be pushed to excel.
Dan
Thank you both for your help! I assume since the Python library is not official it won't be supported? I would really like to see this as a possibility because of memory issues. I do have a workflow and can output CSV etc but that's creating an extra file. Since virtual memory uses disk, too, this can be problematic.
I did see this topic:
I was going to see if I could try that somehow
@vcarey I saw the reference to "\Alteryx\bin\PyYXDBReader.pyd" as well which may be helpful for you. I think pyd files are kind of like .dll files. Not sure if it is as simple as doing import PyYXDBReader but possibly. You probably also need the search path for the pyd file in PYTHONPATH if you go down this route.
So I think I have this working. Here is part of the code in the Python tool (may be incomplete as I'm pasting from a larger program). Sorry about the formatting.
from ayx import Package, Alteryx
import importlib, os, sys
import pandas as pd
import numpy as np
# See https://stackoverflow.com/questions/74841508/cant-pickle-class-import-of-module-failed
def import_module(name, path):
spec = importlib.util.spec_from_file_location(name, os.path.abspath(path))
module = importlib.util.module_from_spec(spec)
sys.modules[name] = module # Add this
spec.loader.exec_module(module)
return module
PyYXDBReader = import_module('PyYXDBReader', "C://Program Files/Alteryx/bin/PyYXDBReader.pyd" )
reader = PyYXDBReader.AlteryxYXDB()
reader.open("Some/file/path/here.yxdb")
meta = reader.get_record_meta()
names = [m['name'] for m in meta]
data = pd.DataFrame( reader.read_records(reader.get_num_records()), columns=names)
What is the status of this package also (PyYXDBReader.pyd)? As far as I can tell, it is nearly undocumented, functions and classes have no doc strings. Is this a stable interface we should use?
It would be nice to have this as part of the Alteryx package, e.g. Alteryx.read_yxdb
I do like that you can read a certain number of records at a time, and repeat to read more
@vcarey wrote:Thank you both for your help! I assume since the Python library is not official it won't be supported? I would really like to see this as a possibility because of memory issues. I do have a workflow and can output CSV etc but that's creating an extra file. Since virtual memory uses disk, too, this can be problematic.
Supported by Alteryx? No. I'm the creator of that library and would be responsible for supporting it.
There is 1 major limitation you should be aware of: My library cannot read YXDB files generated by the AMP engine. Alteryx changed the format of AMP YXDBs and have not released the file spec. I will continue to ask them to. You can get around this by setting the 18.1 compatibility checkbox in the Output tool.
Regarding memory issues, you should not have any with my library. My library reads one record at a time, and therefore only consumes enough memory to read one record at a time. This give you 2 options for working with large files:
I am happy to answer any questions or concerns if you want to try out my library. I've not used Alteryx's .pyd file and cannot give guidance on that.
@tlarsen7572 Thank you for the information about the library and for your work putting it together! It's good to know it can help with memory issues