Help build workflow for feature engineering property data

Question

* I'm running 2020.2.3 so do not have the automated feature engineering tool available yet.

I’m interested in learning how others would tackle a workflow I need to build.   I'm working on a ML model for Real Estate properties and have a couple ideas that I think will be important features to test out.

1) The uniqueness of the property’s characteristics to the area or neighborhood.

2) The availability of data, especially similar property data in the area. In other words, how many similar homes have recently sold, are pending sale or actively listed.

To tackle this, I have raw level listing service data available to me.   It’s many millions of records, so the workflow needs to be as optimized as possible and will run on Alteryx Server.

Part 1: I think the following are of high importance as they differ from the mean. Bath count, lot size, year built and livable square feet. I’d be comparing a specific property to the area.   Those field are:

- FA_BATHSTOTAL

- LOTSIZEAREASQFEET

- FA_YEARBUILT

- FA_SQUAREFEET

Part 2: I need to look at the availability of data in the area and specifically the comparable available data.  How many similar homes have recently closed (last six months).  How many similar active listings are there?   What is the tightness of the sale prices and active listing prices?    To do that, I can use the following fields:

- LISTPRICE

- STATE2

- ZIP2

- CENSUSTRACT (if available)

- PROPERTY TYPE

- FA_BATHSTOTAL

- LOTSIZEAREASQFEET

- FA_YEARBUILT

- FA_SQUAREFEET

- FA_LISTDATE

- FA_CLOSEDATE

- FA_CONTRACTDATE

To make sense of this, I need to determine how many sales, pending sales (contract date within 6 months of today contract data but no close date) and active listings (listing date within 12 months and no contract or close date) are available in the zip code OR census tract (if census data is widely available for the area).  Then I need to determine if the specific property is within standard deviation of the area for age, lot size, square feet, bath count.    This should give some basic idea of the uniqueness of the specific property compared to homes around it.

The goal is to get the total sales, pending and active listings in the zip code and the rate of those that are like the specific property I’m modeling for.  These should be good indicators of available data and uniqueness of the specific property.

Attached is sample not real world data.

Sample MLS workflow.yxzp

BigDataGeek · Answer

Hey Dan,

Thanks for the idea of looking at the predictive grouping tools. I'll start researching that out. That's what I'm after is ideas on ways to approach this I haven't considered.  I updated the packaged workbook to include the link.  What I'm really after is how hard it is to value a particular property.   The uniqueness and availability of similar property data believed to likely heavily contribute to difficulty.

danilang · Answer

Hi @BigDataGeek

It looks like you're trying to predict future opportunities based on past data.  There is no one-size fits all solution to this problem.  You need to analyze your existing data and apply a predictive method based on the characteristics of it.  Good places to start investigating this are the Predictive Grouping and Predictive Modeling section of the interactive training videos.  You think that the 4 fields mentioned in part 1 are significant, but proper analysis of the historical data will tell you if they are, what fields may be more significant and also catch correlations within the fields, i.e. lot size and house size are probably strongly correlated, so you might exclude one of them

BTW.  Your attached sample is empty because you didn't include the .xslx file as an asset in the package

Dan