Improve Help Documentation or in-tool options for handling null values in statistical tools like Weighted Average or Linear Regression. For instance, checkbox to remove null value records, or at least warn users.
In the processing of learning to perform linear regression in RStudio and Alteryx, I came across differing outputs depending on how null values were addressed. Take the Weighted Average tool for example.
In R, the weighted.mean function treats null values in the variable of interest as if they were not there. If the user does not specify that null values exist, the result is NA. If any null values exist in the weight field, the result is NA.
Since I am more familiar with Alteryx, I originally did the data preparation—including calculating the weighted means—in Alteryx. When comparing these weighted means with those generated in R, I found that Alteryx treats the null values as zeros (i.e. includes them in the calculation). The user would have to know this is incorrect and first filter out the null values. See screenshot examples.
This is also the case within the Linear Regression tool. If null values are not omitted prior to regression, the results are wildly different. Perhaps this is known by more experienced users/statisticians, but this incorrect usage would have gone on unbeknownst to be had I not cross-checked with RStudio.