* I'm running 2020.2.3 so do not have the automated feature engineering tool available yet.
I’m interested in learning how others would tackle a workflow I need to build. I'm working on a ML model for Real Estate properties and have a couple ideas that I think will be important features to test out.
1) The uniqueness of the property’s characteristics to the area or neighborhood.
2) The availability of data, especially similar property data in the area. In other words, how many similar homes have recently sold, are pending sale or actively listed.
To tackle this, I have raw level listing service data available to me. It’s many millions of records, so the workflow needs to be as optimized as possible and will run on Alteryx Server.
Part 1: I think the following are of high importance as they differ from the mean. Bath count, lot size, year built and livable square feet. I’d be comparing a specific property to the area. Those field are:
- FA_BATHSTOTAL
- LOTSIZEAREASQFEET
- FA_YEARBUILT
- FA_SQUAREFEET
Part 2: I need to look at the availability of data in the area and specifically the comparable available data. How many similar homes have recently closed (last six months). How many similar active listings are there? What is the tightness of the sale prices and active listing prices? To do that, I can use the following fields:
- LISTPRICE
- STATE2
- ZIP2
- CENSUSTRACT (if available)
- PROPERTY TYPE
- FA_BATHSTOTAL
- LOTSIZEAREASQFEET
- FA_YEARBUILT
- FA_SQUAREFEET
- FA_LISTDATE
- FA_CLOSEDATE
- FA_CONTRACTDATE
To make sense of this, I need to determine how many sales, pending sales (contract date within 6 months of today contract data but no close date) and active listings (listing date within 12 months and no contract or close date) are available in the zip code OR census tract (if census data is widely available for the area). Then I need to determine if the specific property is within standard deviation of the area for age, lot size, square feet, bath count. This should give some basic idea of the uniqueness of the specific property compared to homes around it.
The goal is to get the total sales, pending and active listings in the zip code and the rate of those that are like the specific property I’m modeling for. These should be good indicators of available data and uniqueness of the specific property.
Attached is sample not real world data.