Python and alteryx - Tabula
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi
I'm 100% new to python, but im trying to use the tabula libary.
Basicly I want to to to load a pdf file, and make it as a dataframe.
And it loads the file correctly, but i get som odd error. And I have no idea what it means? 😕
"Warning: Python (1): Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray"
Solved! Go to Solution.
- Labels:
- Python
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @Hamder83,
This is a cool library! The reason you're getting this error is that you're trying to combine dataframes of differing schemas. I used the following code to look at the rows, columns, and datatypes of the output of tabula. Tabula already outputs dataframes, so you probably don't need to re-convert them later.
for iteration in a:
print("rows:",len(iteration),"; columns:",len(iteration.columns), type(iteration))
For the pdf I input to tabula, my dataframes were all different sizes. Naturally, pandas didn't know how I wanted to combine them. Note that if you look at the variable "df", it still worked! Pandas just smushed all the dataframes into a single column. The red background text it gave you is the equivalent of an Alteryx warning, not an error.
Going forward (and depending on your use case), I'd recommend you either reformat your columns to align and use pd.merge() to combine all the dataframes output by tabula. You can use the help() function on a python function to ask Jupyter what parameters it wants. For example:
help(tabula.read_pdf)
I also find that Google Colab is far more intuitive for learning Python than Jupyter notebooks. I practice most of my syntax in Colab before transferring it into Alteryx Jupyter.
If this helps, please consider marking it as a solution so others may find it. Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Not a direct answer, but more of an aside - there is also tabula specific R package. It may be easier to use than the python implementation.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you for a super fine explanation, I will try and dive further into it 🙂
This is definitely a good help!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Sounds interesting, i'll have a look at that, thanks 🙂
