United States house price analysis and investment consulting in Zillow

In this article, we plan to do consultant suggestions in real estate investment. We have built comprehensive models for house price predictions, house investment risk control, and time series analysis. In these processes, plotting some of the graphs.

Yefeng Liang

Code for this project: https://github.com/AlbertNightwind/Zillow_House_Price_Analysis_Project

Dataset:

https://www.zillow.com/research/data/

https://www.census.gov/data/tables/time-series/demo/popest/2010s-national-total.html

Below are the windows that we got the dataset. these two datasets will be used in the project.

Zillow house price dataset source (https://www.zillow.com/research/data/)

From the website above, you can download the newest dataset with the same frame provided by Zillow, however, in this article, we use the dataset time range from Jan. 1996 to July. 2020.

Population change dataset from us census (https://www.census.gov/data/tables/time-series/demo/popest/2010s-national-total.html)

Section0 What the data look like

In this project, we will consult for the house investors, and advise on American regions, major cities that they should invest in. We will isolate a group of the most promising regions, then run a forecasting analysis to pinpoint those regions with the best potential attributes. We will mention those attributes in the latter part of this article.

Zillow dataset
Zillow dataset
Zillow dataset

As much data was collected several years ago, it’s inevitable that many of them lack some features, in this section, we will focus on the time features, which range from column 7 to the last one. But before, we need to plot a graph to summarize all of the time columns without data.

Number of regions lack house price data

Drop these columns. Then we got completed data with all of the dates.

Processed Zillow dataset
Zillow dataset
Zillow dataset
Zillow dataset

As for this dataframe, we should use pd.to_datetime to convert the date time into the defined schema ‘%Y-%m’. In order to make the visualization analysis easier, we need to import geopandas to read file ‘states.shp’, which can show the map projection of the United States, can be downloaded from

https://github.com/calbal91/project-ARIMA-modelling/tree/master/Data

Section01 State-level Visualization

For each date index, we divided the house price by mean, max, and percentage, such as:

House price data by percentage, max, and mean

Then plot the house price trends and fill color between these trend lines.

US House percentage price trends

We can see two conclusions here. First, 2006–04–01, 2012–04–01 are two turning points, house price in these two date-time points leads to a long-time period of house price trends. The subprime mortgage crisis happens in 2007–04, thus we choose 2007–04–01as our turning points. Second, the high percentage(75%-95%) house price has more significant fluctuation. These two conclusions will be the pre-conclusions of our house price analysis.

Here we finished preparing the timeline for our analysis, we need to import geopandas to read file ‘states.shp’, which can show the map projection of the United States, in order to show the performance level of each state, we define a function to help plot the house growth rate map based on the prepared timeline. This step made the map more straightforward to see.

96–07 mean house price growth rate of each state
07–12 mean house price growth rate of each state
12–18 mean house price growth rate of each state

Florida, California, Nevada, Arizona have sensitive price change in these three time periods. Combined with US House percentage trends, we can initially assume that many houses with top percentage prices are mainly located in these states. Could be having more risks and earning on price fluctuations.

Section02 State-level population Visualization

The needing of the house can be a fairly important factor which can influence house price, furthermore, the population change is actually the contribution for the needing of house. There would be many influence factors that play roles in house price fluctuation. In this project, we assume the region’s population growth will be the main factor that influences the house price in this region.

Here are some graphs about the population.

The first column is the population growth of the whole country, then are the four main regions, as we can see, West and South contribute mainly population growth to the whole country.

Then we can use the graph above to show the population growth of each state and Puerto Rico.

The population growth rate in each state from 2010 to 2019

Plot the growth situation on the U.S. map. To compare the U.S. map which shows house price growth from 2012 to 2018, Arizona, Utah, Nevada, Colorado, and Florida have both significant population growth and house price growth in about last decades. Texas, however, has high population growth (the basic population of Texas is №2 in the U.S.) but mild house price growth. This seems to be a very positive signal about long-time investing or buying houses in Texas.

Section03 Region-level Visualization

With last 2010–2019 population change, and house price increase in the same time periods of each state, in this section, we will focus on regions house price change. First, we need to define a column as “growth region”. When a region has house price increase above the average in both 2007–2012 and 2012 to 2018, and assume it as 1 in this column, otherwise as 0. We then plot all of these 1 and 0.

As graphs are shown below, regions with the largest size contribute the most portion of “growth region”, in other regions which beyond “growth region”, regions with different size looked to have the same portion in a graph. Many of those largest size regions are belong to the top100 metropolitans in the U.S.

Then, based on regions’ distribution in each state, Texas, Colorado, Tennesse and California have most regions belong to defined “growth regions”. Utah, Colorado, North Dakota, Texas, and Tennesse have many of their regions that belong to defined “growth regions”. All of these states have significant population increase from 2010 to 2019. Especially for Texas and Colorado, both states have top-level “growth regions” and high rates of “growth regions” around the nation. Again, this means Texas and Colorado have many of its regions have house price increase above the average from 2007 to 2018. Which can be considered as a steady increase in over 10 years.

Growth regions in each state
Growth regions’ rates in each state

Large increase with population, steady increase with house price, it seems reasonable. However, for investors, this result is still not very targeted, they are not satisfied with these states with many increasing regions, they need to know more concrete regions that are valued to invest in these states. On the other hand, other states, still have many covered valuable regions that perform well in the house price increase, although these states have modest performance in total, 2 big-time periods are also too rigid for the house price increase analysis.

Section04 Statistic model analysis visualization

For every growth regions’ ID, we use ARIMA model to do predictions on house prices. We defined a function to got ARIMA “ DataFrame” here, which includes the predictions of top and low prediction prices and prediction prices.

The DataFrame we got from ARIMA model

Then we add columns about some attributes from the ARIMA model, and Absolute values increase and percentage values increase for 5 and 10 years.

The completed DataFrame we got from ARIMA model

Since the latest general house price increase trends started from the beginning of 2012 until now. We will use the data from April 30, 2012, to now, as our training and testing dataset (split with 0.8 and 0.2) used in the ARIMA model,

We first used ARIMA in “statsmodels.tsa.arima_model” package to train the model with the training dataset, then used the trained ARIMA model to predict house price on the testing dataset, which starts from 2012–04–30 and data after that time, then we got our prediction results. We use RMSE as our prediction measure, if RMSE is smaller than the value we defined(10000000) in this project, we assume that as our ‘fitted model’ (best model) and store them as one column in our ARIMA DataFrame.

For the model’s list-like forecast results, we can get many forecasted variables, among these variables, we have lower forecasted prediction values and upper forecasted prediction values, we will then plot these two variables, and paint color between the lines of these 2 variables. If the paint areas have more squares, which means residuals between upper predicted values and lower predicted values are larger, the price trends that we predicted have more predicted“risks” because of the large price prediction distributions along with time.

Section05 Final consulting results

First, we sort the top 10 regions with the best 10-year growth.

Regions with the highest 10-year increase

Then, in order to produce the “best valuable regions” for our investors, these kinds of regions have better 10-year increase and lower predicted “risks”. 2 measures used here, 10-year growth and forecasted width, which comes from residuals between upper predicted values and lower predicted values. If regions have both 10-year growth and forecasted width less than the average of 10-year growth and average of forecasted width, we assume these regions as the “most valuable” regions to invest in. Still, we sort these “most valuable” regions by their 10-year growth and got the top 10 “most valuable” regions by 10 -year growth.

Besides, we will sort these “most valuable” regions by their forecasted width and got the top 10 “most valuable regions” with the lowest risks.

Most valuable regions to invest in (The most 10-year growth)
Most valuable regions to invest in (The lowest risks)

Here are more details about the top 10 most valuable investment regions, indexed by their Zip codes.

Details about most valuable regions by 10-year growth
Details about most valuable regions by low risks

Section 06: Conclusions

LA metropolitan dominates 9 of the top 10 most valuable regions with 10-year growth to invest in. We recommend investors consider more about LA metropolitan if they want to earn more in real estate investment, although these regions have some risks. But considered both price increase and risks, the house price in the LA area is very impressive and attractive. If they want to invest besides California and Los Angeles, Dallas metropolitan area would also be another good choice.

If you prefer to have a more stable investment plan, places in states such as Texas, Colorado, and Utah, for example, Austin, Houston would be recommended, this analysis result is very relative to the results that we got from section 03. Texas, Utah, and Colorado have become the places where people have been moving to in the last decades. Therefore, these states are the recommendations from this project, which comes from lower risks and much higher potential increase range in the next few years.

I am now a current graduate student major in Computer Science, I am interested in Data Science, and I have a plan to pursue my Phd degree in the future.