I have multiple datasets with column headers in Japanese. I ran the headers through Google Translate and put the original and translated header values in a spreadsheet so that when we receive updated datasets in the future, I can use this as a template to easily find and replace headers with the translation. I created what I thought was a pretty straightforward Python script to load each dataset into a dataframe, union all of the dataframes based on the common set of fields between them, and then rename the headers with the translations.
from ayx import Alteryx
import pandas as pd
# Import data & translation template
df_2401_2t = pd.read_excel('**filepath + sheet name**')
df_2401_4t = pd.read_excel('**filepath + sheet name**')
df_2601 = pd.read_excel('**filepath + sheet name**')
df_trans = pd.read_excel('**filepath + sheet name**')
frames = [df_2401_2t, df_2401_4t, df_2601]
# Union df on common field names
df = pd.concat(frames, join = 'inner', ignore_index = True)
# Find & replace headers with translations
for col in df.columns:
for i in df_trans.index:
if col == df_trans['Original'][i]:
df.rename({col: df_trans['Translation'][i]}, axis = 1, inplace = True)
# Output df to workflow
Alteryx.write(df, 1)
However I'm getting the following error when I try to write df back out to Alteryx:
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-19-775ca6c91598> in <module> ----> 1 Alteryx.write(df, 1) e:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\export.py in write(pandas_df, outgoing_connection_number, columns, debug, **kwargs) 86 """ 87 return __CachedData__(debug=debug).write( ---> 88 pandas_df, outgoing_connection_number, columns=columns, **kwargs 89 ) 90 e:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\CachedData.py in write(self, pandas_df, outgoing_connection_number, columns, output_filepath) 430 # (check only first non-null value in column -- tradeoff for efficiency) 431 col_contains_bytearrays = coltype == "object" and isinstance( --> 432 firstValidValue(pandas_df[colname]), (bytearray, bytes) 433 ) 434 try: e:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\DataUtils.py in firstValidValue(pd_series) 46 def firstValidValue(pd_series): 47 if not isinstance(pd_series, pd.core.series.Series): ---> 48 raise TypeError(f"input must be a pandas series, not a {type(pd_series)}") 49 if hasattr(pd_series, "first_valid_index"): 50 first_valid_index = pd_series.first_valid_index() TypeError: input must be a pandas series, not a <class 'pandas.core.frame.DataFrame'>
Not really following here, it's telling me input must be a pandas series but I thought Alteryx.write() required a pandas dataframe, which is exaclty what df is? If anyone can point me in the right direction it would be much appreciated.
Solved! Go to Solution.
Does the same error occur if you replace,
Alteryx.write(df, 1)
with,
Alteryx.write(df_2401_2t, 1)
Could you share the 2401_2t xlsx file? My feeling is there's a value in this that's confusing Alteryx.
You could also try doing df = df.reset_index() just before writing the output. Though it's probably not that.
Thanks Philip. I tried writing df_2401_2t as you suggested and that works fine:
SUCCESS: writing outgoing connection data 1
I also tried df.reset_index(), but I still get the same error. Unfortunately I can't share the .xlsx files because they contain proprietary data.
We encountered this issue in datasets with duplicate column names.
Something about Alteryx.write() does not appreciate the duplicates and cannot deal with them.
Here's the Python function we used to "uniquefy" the column names, typically called just prior to writing the output.
Good luck 🙂
def make_columns_unique(df):
"""Checks column names in df argument and renames incrementally.
Duplicate columns will be renamed with an incrementing integer.
For instance col | col | col will be renamed col | col.1 | col.2
-- df: Pandas dataframe to have it's columns "uniquefied".
Returns the dataframe with newly renamed columns.
"""
# Get the full column list
columns_all = df.columns
# Use a set to get the unique column names
columns_unique = set(columns_all)
# If the two match in length, there are no duplicates, return the df
if len(columns_all) == len(columns_unique):
return df
# If the lengths differ, there is at least one duplicate
# Create a dictionary of those unique names to keep track of their counts
columns_unique_counter = {}
for column in columns_unique:
columns_unique_counter[column] = 0
# Loop through the columns in the df
# Add the column names (renamed or not) to a new list
columns_new = []
for name in columns_all:
name_counter = columns_unique_counter[name]
if name_counter != 0:
# This column name has been seen before
# Rename the column in the list
columns_new.append(f"{name}.{name_counter}")
else:
columns_new.append(name)
# Whether the column was renamed or not, increment the counter
columns_unique_counter[name] = columns_unique_counter[name] + 1
# Set the new column names
df.columns = columns_new
return df
This makes sense, I'll give it a shot. Thanks!
This was precisely my problem--duplicate headers. I just did a df.columns look, pasted it into Excel, cut it up, did a find duplicates conditional format. BOOM. Popped right out.
Alteryx needs to come up with a better answer or make this a promenent find in their help. I had to use Google to get to this solution!