Alteryx.write error - "input must be a pandas series"

I have multiple datasets with column headers in Japanese. I ran the headers through Google Translate and put the original and translated header values in a spreadsheet so that when we receive updated datasets in the future, I can use this as a template to easily find and replace headers with the translation. I created what I thought was a pretty straightforward Python script to load each dataset into a dataframe, union all of the dataframes based on the common set of fields between them, and then rename the headers with the translations.

from ayx import Alteryx
import pandas as pd

# Import data & translation template
df_2401_2t = pd.read_excel('**filepath + sheet name**')
df_2401_4t = pd.read_excel('**filepath + sheet name**')
df_2601 = pd.read_excel('**filepath + sheet name**')
df_trans = pd.read_excel('**filepath + sheet name**')

frames = [df_2401_2t, df_2401_4t, df_2601]

# Union df on common field names
df = pd.concat(frames, join = 'inner', ignore_index = True)

# Find & replace headers with translations
for col in df.columns:
    for i in df_trans.index:
        if col == df_trans['Original'][i]:
            df.rename({col: df_trans['Translation'][i]}, axis = 1, inplace = True)

# Output df to workflow
Alteryx.write(df, 1)

However I'm getting the following error when I try to write df back out to Alteryx:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-19-775ca6c91598> in <module>
----> 1 Alteryx.write(df, 1)

e:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\export.py in write(pandas_df, outgoing_connection_number, columns, debug, **kwargs)
     86     """
     87     return __CachedData__(debug=debug).write(
---> 88         pandas_df, outgoing_connection_number, columns=columns, **kwargs
     89     )
     90 

e:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\CachedData.py in write(self, pandas_df, outgoing_connection_number, columns, output_filepath)
    430             # (check only first non-null value in column -- tradeoff for efficiency)
    431             col_contains_bytearrays = coltype == "object" and isinstance(
--> 432                 firstValidValue(pandas_df[colname]), (bytearray, bytes)
    433             )
    434             try:

e:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\DataUtils.py in firstValidValue(pd_series)
     46 def firstValidValue(pd_series):
     47     if not isinstance(pd_series, pd.core.series.Series):
---> 48         raise TypeError(f"input must be a pandas series, not a {type(pd_series)}")
     49     if hasattr(pd_series, "first_valid_index"):
     50         first_valid_index = pd_series.first_valid_index()

TypeError: input must be a pandas series, not a <class 'pandas.core.frame.DataFrame'>

Not really following here, it's telling me input must be a pandas series but I thought Alteryx.write() required a pandas dataframe, which is exaclty what df is? If anyone can point me in the right direction it would be much appreciated.

Python

Accepted answers

Ross_K

We encountered this issue in datasets with duplicate column names.

Something about Alteryx.write() does not appreciate the duplicates and cannot deal with them.

Here's the Python function we used to "uniquefy" the column names, typically called just prior to writing the output.

Good luck 🙂

def make_columns_unique(df):
    """Checks column names in df argument and renames incrementally.

    Duplicate columns will be renamed with an incrementing integer.
    For instance col | col | col will be renamed col | col.1 | col.2

    -- df: Pandas dataframe to have it's columns "uniquefied".

    Returns the dataframe with newly renamed columns.
    """
    # Get the full column list
    columns_all = df.columns
    # Use a set to get the unique column names
    columns_unique = set(columns_all)
    # If the two match in length, there are no duplicates, return the df
    if len(columns_all) == len(columns_unique):
        return df
    # If the lengths differ, there is at least one duplicate
    # Create a dictionary of those unique names to keep track of their counts
    columns_unique_counter = {}
    for column in columns_unique:
        columns_unique_counter[column] = 0
    # Loop through the columns in the df
    # Add the column names (renamed or not) to a new list
    columns_new = []
    for name in columns_all:
        name_counter = columns_unique_counter[name]
        if name_counter != 0:
            # This column name has been seen before
            # Rename the column in the list
            columns_new.append(f"{name}.{name_counter}")
        else:
            columns_new.append(name)
        # Whether the column was renamed or not, increment the counter
        columns_unique_counter[name] = columns_unique_counter[name] + 1
    # Set the new column names
    df.columns = columns_new
    return df

All comments

philipmannering

Does the same error occur if you replace,

Alteryx.write(df, 1)

with,

Alteryx.write(df_2401_2t, 1)

Could you share the 2401_2t xlsx file? My feeling is there's a value in this that's confusing Alteryx.

You could also try doing df = df.reset_index() just before writing the output. Though it's probably not that.

SWood

Thanks Philip. I tried writing df_2401_2t as you suggested and that works fine:

SUCCESS: writing outgoing connection data 1

I also tried df.reset_index(), but I still get the same error. Unfortunately I can't share the .xlsx files because they contain proprietary data.

Ross_K

We encountered this issue in datasets with duplicate column names.

Something about Alteryx.write() does not appreciate the duplicates and cannot deal with them.

Here's the Python function we used to "uniquefy" the column names, typically called just prior to writing the output.

Good luck 🙂

def make_columns_unique(df):
    """Checks column names in df argument and renames incrementally.

    Duplicate columns will be renamed with an incrementing integer.
    For instance col | col | col will be renamed col | col.1 | col.2

    -- df: Pandas dataframe to have it's columns "uniquefied".

    Returns the dataframe with newly renamed columns.
    """
    # Get the full column list
    columns_all = df.columns
    # Use a set to get the unique column names
    columns_unique = set(columns_all)
    # If the two match in length, there are no duplicates, return the df
    if len(columns_all) == len(columns_unique):
        return df
    # If the lengths differ, there is at least one duplicate
    # Create a dictionary of those unique names to keep track of their counts
    columns_unique_counter = {}
    for column in columns_unique:
        columns_unique_counter[column] = 0
    # Loop through the columns in the df
    # Add the column names (renamed or not) to a new list
    columns_new = []
    for name in columns_all:
        name_counter = columns_unique_counter[name]
        if name_counter != 0:
            # This column name has been seen before
            # Rename the column in the list
            columns_new.append(f"{name}.{name_counter}")
        else:
            columns_new.append(name)
        # Whether the column was renamed or not, increment the counter
        columns_unique_counter[name] = columns_unique_counter[name] + 1
    # Set the new column names
    df.columns = columns_new
    return df

SWood

This makes sense, I'll give it a shot. Thanks!

BretCarr

This was precisely my problem--duplicate headers. I just did a df.columns look, pasted it into Excel, cut it up, did a find duplicates conditional format. BOOM. Popped right out.

Alteryx needs to come up with a better answer or make this a promenent find in their help. I had to use Google to get to this solution!

Quick Links

This months top contributors

atcodedog05 19598

Qiu 15879

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703