In this piece, I detail the process and results of modeling the Zillow Home Value Index (ZHVI) by ZIP Code looking forward three, six, and twelve months. The models presented here may serve as a useful tool for people considering buying or selling a home in the next twelve months within a ZIP Code covered by the models.
First, it’s important to understand what is being modeled and why. A foundational concept is Zillow’s Zestimate which is “the estimated market value for an individual home calculated for about 100 million homes nationwide. The Zestimate is automatically computed daily based on millions of public and user-submitted data points.” In each geographic region (ZIP Code in this case), Zillow has a Zestimate of the value of each home. Using these estimated values, the Zillow Home Value Index (ZHVI ) is generated by finding “the median Zestimate for a given geographic area on a given day.”
A successful predictive model for the Zillow Home Value Index for a region would have desirable properties. As Zillow explains, “How does the Zillow Home Value Index affect my home? In many ways! If the Zillow Home Value Index for your county is $215,000 today and was $210,000 yesterday, this means that a typical home in your area is worth more today than yesterday. So, if you’re thinking of selling, you can evaluate your own home relative to the surrounding market, or if you’re buying, you can learn what’s happening in other markets.” The ZHVI provides context-specific information for home owners about what is happening in their geographic region. As Zillow further explains, “Is the Zillow Home Value Index the best indicator of tracking real estate markets? We feel it is, because with the Zestimate, we have an estimate of the current value of every home in the area and, thus, can estimate what the median sale price of the whole area would be if every home were sold on the same day: It would approximately equal the median Zestimate, or Zillow Home Value Index for that area.” Predicting changes in the ZHVI would have predictive weight for the change in value of an individual home within a region.
Before detailing the data used, it’s worth clarifying that this analysis is focused on the “micro” drivers of changes in housing prices rather than the “macro” — that is to say, it is concerned with modeling the regional dynamics of housing values (the supply and demand for homes in the region, the pricing of alternatives like renting instead of buying, etc.) over the short-run. Therefore, it ignores features like interest rates, long-run population changes, and building plans.
This analysis relies on the following data pulled from Zillow by ZIP Code at a monthly interval from January 2013 through April 2019 used as features:
- Median listing price
- Percent of listings with price reductions
- Median percent of price reduction
- Monthly for-sale listings
- Median daily for-sale inventory
- Price-to-rent ratio (“This ratio is first calculated at the individual home level, where the estimated home value is divided by 12 times its estimated monthly rent price. The median of all home-level price-to-rent ratios for a given region is then calculated.”)
- Buyer Seller Index Cross-Time (“This index combines the median sale-to-list price ratio, the share of listings with a price cut, and the median time on market for homes in a given region and month. The cross-time version of the index compares this value with the region’s own historical values; the cross-region version compares this value with other regions’ current values. Higher values indicate a relative sellers market in both cases.”)
Initially, I also included the share of all sales in which the home was previously foreclosed upon as a feature but the data was only current through January 2019 for almost all geographies, meaning that I couldn’t use it to make predictions from the current time looking forward. There were also additional data points that were interesting, like age of inventory and sale-to-list ratio, but they weren’t available at the granularity of ZIP Code.
After gathering the data, I applied a series of scripts in Python to process and clean the data, then load it into Pandas DataFrames. Next, I joined the data by ZIP Code and year+month to create a table for each ZIP Code with each row representing a month in the time period and the columns representing the value of each feature. Finally, I generated my target values by creating a lead/lag column on the ZHVI for three months, six months, and twelve months ahead of the current month.
The graphic below illustrates what is being modeled. Feature data from January 2013 is used to forecast what the ZHVI value will be in April 2013, July 2013, and January 2014.
Not all of the ZIP Codes in the data set could be modeled. Some of the ZIP Codes had long stretches of missing data for one or more features, which I excluded with a script that iterated over all ZIP Codes and removed ones with these data gaps. Additionally, in many of the ZIP Codes, the ZHVI was monotonically increasing (or very nearly so) over the period studied, meaning the future predictive value of a model trained on the historical data would be limited. After eliminating candidates in this two step process, I focused on the ten largest remaining ZIP Codes (as ranked by Zillow) to assess the predictive value that modeling could provide: 37042 (Clarksville, TN), 77479 (Sugar Land, TX), 87114 (Albuquerque, NM), 06010 (Bristol, CT), 33908 (Fort Myers, FL), 97229 (Portland, OR), 29732 (Rock Hill, SC), 28314 (Fayetteville, NC), 40475 (Richmond, KY), 19111 (Philadelphia, PA).
The chart below shows the ZHVI over time from January 2013 to April 2019 for the city associated with each ZIP Code. For each series of values, I indexed the start date to a value of 100 to make the change in ZHVI between the ZIP Codes comparable (ex: an ending ZHVI of 130 would represent a 30% increase while an ending ZHVI of 80 would represent a 20% decrease).
I transformed every feature other than the Buyer Seller Index with a standard scaler relative to other values for the same ZIP Code across time. The Zillow Buyer Seller Index is already scaled across time for a specific geography and does not need to be further modified.
At the outset, I imagined that my models would support an interface that would tell users how likely different scenarios would be going forward (ex: how likely is it that the value of my home is >2% of its current value in 12 months?), so I turned the problem of forecasting the ZHVI into a series of binary classification problems for each ZIP Code that I could generate a probability for including:
- How likely is it that the ZHVI will be higher in three months?
- How likely is it that the ZHVI will be higher in six months?
- How likely is it that the ZHVI will be higher in twelve months?
- How likely is it that the ZHVI will increase >2% in twelve months?
- How likely is it that the ZHVI will increase >5% in twelve months?
I also wanted my models to be geography specific, so I trained an instance for each geography. To select and tune the models, I created a meta function that would iterate over the eligible ZIP Codes and fit four different types of models: logistic regression, support vector machine, random forest, and XGBoost. My meta function also incorporated hyperparameter tuning using GridSearchCV. Given that the classes were often imbalanced, I elected to optimize my models for the area under the receiver operating characteristics curve.
The results presented show the ten largest ZIP Codes that meet the requirements of having complete data (including all data from the most recent month, April 2019) and have at least one instance of each class for all five scenarios modeled. The table below shows the performance of the models for each zip code (two thirds of the data was randomly sampled for training and testing was done against the remaining third). The AUC corresponding to the given binary classification problem is shown along with the number of instances in each class shown as larger/smaller (ex: 14/8) below.
In almost all cases, the models perform significantly better than chance. The models perform particularly well in scenarios involving roughly balanced classes.
Caveats + Future Work
Any model that deals with time series financial data must be accompanied with the caveat that past performance is not indicative of future results. A point in favor of the models presented here, however, is the reasonable economic relationship between the input features being used and the output target. Within quantitative finance, there is a strong preference for models that can be plausibly explained by a real-world phenomenon that the model is capturing. In this case, it is reasonable to say that the features approximate aspects of supply and demand within a geographic region, which should have an impact on future prices.
To extend my analysis, I am experimenting with augmenting the Zillow data with additional external data. Additionally, I am investigating the transfer ability of models across geographic regions to make better predictions for markets that have only experienced increases over the period studied.
I am in the process of developing a web interface for my models and will update this article with the link when complete.