Python tool: read yxdb from a file (not from input connection)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi,
I'd like to use the Python tool, but import a yxdb file from a file path, NOT by connecting an input to the tool.
The reason for this is that when you use the usual import, first the file is loaded into Alteryx memory, then again loaded into Python memory upon Alteryx.read(), so you essentially double the memory usage for a dataset. But if you only need the data for the Python process this is unnecessary and actually makes some tasks impossible.
Is there a way to do this?
Thanks!
Solved! Go to Solution.
- Labels:
- Python
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @vcarey
A really smart Alteryx developer @tlarsen7572 did exactly that by creating a Python library: https://github.com/tlarsendataguy-yxdb/yxdb-py
Important to note that this is not officially Alteryx developed, but impressive nonetheless!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @vcarey
AFAIK there is no other tool that can read .yxdb files. It's a proprietary format that Alteryx has never published the specifications for. This statement is no longer true after seeing Brandon's post. You can probably ignore the rest of this post as well.
Conversely, since it is proprietary, the only tool that can produce .yxdb files is Alteryx Designer. That means that the .yxdb files that you're trying to read are created by a workflow that you probably have access to. Can you modify the output of that workflow to produce .csv files? I assume that if memory is a concern, there is too much data to be pushed to excel.
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you both for your help! I assume since the Python library is not official it won't be supported? I would really like to see this as a possibility because of memory issues. I do have a workflow and can output CSV etc but that's creating an extra file. Since virtual memory uses disk, too, this can be problematic.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I did see this topic:
I was going to see if I could try that somehow
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@vcarey I saw the reference to "\Alteryx\bin\PyYXDBReader.pyd" as well which may be helpful for you. I think pyd files are kind of like .dll files. Not sure if it is as simple as doing import PyYXDBReader but possibly. You probably also need the search path for the pyd file in PYTHONPATH if you go down this route.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
So I think I have this working. Here is part of the code in the Python tool (may be incomplete as I'm pasting from a larger program). Sorry about the formatting.
from ayx import Package, Alteryx
import importlib, os, sys
import pandas as pd
import numpy as np
# See https://stackoverflow.com/questions/74841508/cant-pickle-class-import-of-module-failed
def import_module(name, path):
spec = importlib.util.spec_from_file_location(name, os.path.abspath(path))
module = importlib.util.module_from_spec(spec)
sys.modules[name] = module # Add this
spec.loader.exec_module(module)
return module
PyYXDBReader = import_module('PyYXDBReader', "C://Program Files/Alteryx/bin/PyYXDBReader.pyd" )
reader = PyYXDBReader.AlteryxYXDB()
reader.open("Some/file/path/here.yxdb")
meta = reader.get_record_meta()
names = [m['name'] for m in meta]
data = pd.DataFrame( reader.read_records(reader.get_num_records()), columns=names)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
What is the status of this package also (PyYXDBReader.pyd)? As far as I can tell, it is nearly undocumented, functions and classes have no doc strings. Is this a stable interface we should use?
It would be nice to have this as part of the Alteryx package, e.g. Alteryx.read_yxdb
I do like that you can read a certain number of records at a time, and repeat to read more
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@vcarey wrote:Thank you both for your help! I assume since the Python library is not official it won't be supported? I would really like to see this as a possibility because of memory issues. I do have a workflow and can output CSV etc but that's creating an extra file. Since virtual memory uses disk, too, this can be problematic.
Supported by Alteryx? No. I'm the creator of that library and would be responsible for supporting it.
There is 1 major limitation you should be aware of: My library cannot read YXDB files generated by the AMP engine. Alteryx changed the format of AMP YXDBs and have not released the file spec. I will continue to ask them to. You can get around this by setting the 18.1 compatibility checkbox in the Output tool.
Regarding memory issues, you should not have any with my library. My library reads one record at a time, and therefore only consumes enough memory to read one record at a time. This give you 2 options for working with large files:
- Read the records one-at-a-time and perform running calculations on it. Memory usage will be very low.
- Batch records into a pre-sized data structure (such as a dataframe), perform your work, then read the next batch, etc
I am happy to answer any questions or concerns if you want to try out my library. I've not used Alteryx's .pyd file and cannot give guidance on that.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@tlarsen7572 Thank you for the information about the library and for your work putting it together! It's good to know it can help with memory issues
