Inside Airbnb Boston 2021: Building a Machine Learning Price Predictor model for Boston Listings
Inside Airbnb Boston 2021: Building a Machine Learning Price Predictor model for Boston Listings
What is the article about?
In this post, I will conduct a data mining effort by following CRISP-DM standards on the dataset of Airbnb Boston Listing 2021. Build a listing price estimator model with Machine Learning. Also, answer some business questions related to this dataset.
Why should the reader learn this topic?
- For people that are interested in how to build a price estimator or predictor model.
- For Airbnb users (host, end-user, traveler) to better understand how to price their room, compare and select rooms within Boston, even negotiate with hosts.
- For agency that wants to start Boston-based Airbnb business.
- For real estate property people that want further discover patterns in the Boston property domain.
What are the prerequisites?
This post is intended for a non-technical overview.
To keep it simple, I abstract away the technical parts from this post.
I host this project code in GitHub, you can checkout full Jupyter notebook here:
GitHub - gilzero/airbnb_boston_2021
Project Background Author Data Source Libraries Used Motivation for The Project Files Summary of the results…
Data can be found here:
Inside Airbnb. Adding data to the debate.
Inside Airbnb is a mission driven activist project with the objective to: Provide data that quantifies the impact of…
Let’s get it stated.
To better organize the structure of this post, we follow this cycle.
- Business understanding
- Data Understanding
- Data Preparation
Let’s define our Questions for understanding of below:
1) What are the most popular property types in Boston listings 2021?
2) What are the most popular / in-demand neighborhood in Boston listings 2021?
3) What are the standard amenities host are offerings in Boston listings 2021?
4) How is the Distribution of Price Range in Boston listings 2021?
5) Boston Pricing Model. What are some top factors that contribute to price?
For example, suppose you were a property investor, by understanding these questions, you might be able to develop a strategy to offer rooms in high-demand neighborhoods with popular property types, make sure those essential amenities are provided. Set a reasonable price checked by the model. Assure a better Airbnb rental income rate.
Covid 19 changes the tourism, rental industries dramatically. Things changed a lot and are more dynamic. So I tried to use the latest data available. At the time of writing, the dataset used in from Airbnb’s official September scraped batch. 2021-Sep
I wouldn’t combine past historical Boston data, especially pre-covid19.
The dataset has 3123 instance of listings, with 74 variables.
There are some missing values in variables like
‘neighbourhood_group_cleansed’, ‘bathrooms’, ‘calendar_updated’, ‘license’, ‘neighborhood_overview’, ‘neighbourhood’, ‘host_about’, ‘review_scores_value’, ‘beds’, ‘bedrooms’…
Note that Airbnb does not encode bathroom numbers numerically, so the entire bathroom column is all empty value. Instead, Airbnb stores as string text in another column ‘bathrooms_text’ as a categorical variable, storing values such as: ‘1 bath’, ‘1.5 shared bath’, ‘half private bath’
I drop those columns that contain a significantly amount of missing/empty values. e.g(> 75% missing)
For categorical columns, I fill the missing with the most common ones in that columns.
For some numeric columns, I fill the missing with median value rather than mean. (For example, we can’t have 1.37 beds, better to use the median in this case as an integer number) It would be more meaningful to fill with 0 for missing value as an initializer with a missing review rating score. It is probably not so fair to assign to ‘4.xx’ average rating for missing.
So for filling missing/empty data, always case by case depending on the attribute of that column, make sure what that variable represents.
Develop the fillna strategy for different variables would help to improve the model later we build on.
The dependent variable ‘price’ is stored as a U.S locale money format. e.g. ‘$1,234.00’. Before sending to computation, we need to transform to pure numeric format. ‘$1,234.00’ => 1234.00
There are quite a several categorical variables. One strategy is to use One-Hot Encoding. We generate new dummy columns for each variable’s value and encode with value 1 or 0 to mark the value.
- Subset numerics dataset.
- Subset categorical dataset and flatten it to a one-hot set.
- Re-combine numerics set and categorical (encoded) set.
(Image made with OmniGraffle)
Now we have a ready-to-roll cleaned dataset to feed to Machine Learning.
Now we come to the exciting part: Modeling, Mining.
Before we start building the model, I quickly answer the above questions.
Popular neighborhoods for Boston Listings 2021
- Dorchester 418
- Downtown 293
- Roxbury 263
- South End 235
- Brighton 232
- Jamaica Plain 215
- Back Bay 195
- East Boston 180
- Allston 158
- Beacon Hill 137
- South Boston 135
- Fenway 110
- North End 80
- Charlestown 70
- Roslindale 56
- Mission Hill 55
- Bay Village 50
- South Boston Waterfront 47
- Chinatown 43
- West End 38
(“Location, Location, Location”)
Popular (high demand) property types for Boston Listings 2021
- Entire rental unit 1283
- Private room in rental unit 526
- Private room in residential home 378
- Entire condominium (condo) 264
- Entire serviced apartment 184
- Entire residential home 126
- Private room in condominium (condo) 71
- Entire guest suite 47
- Private room in townhouse 38
- Room in boutique hotel 36
(Entire unit and private rooms of all kinds are top demand, seems like entire unit are pretty popular in Boston while other cities favor rooms more.)
Common Essential Amenities
- Smoke alarm 3017
- Wifi 2996
- Long term stays allowed 2960
- Carbon monoxide alarm 2813
- Heating 2806
- Essentials 2804
- Kitchen 2795
- Hangers 2728
- Hair dryer 2581
- Air conditioning 2553
- Iron 2505
- Shampoo 2387
- Hot water 2291
- Microwave 2172
- Refrigerator 2151
- Washer 2097
- Dedicated workspace 2096
- Dryer 2048
- Coffee maker 1991
- Dishes and silverware 1916
(Those security-related are on the top: Smoke alarm and Carbon monoxide alarm. Of course, safety first, as always. WIFI, Heating, Long-Term Stays, Heating, follows by Air Con, etc. I once visited Boston 2 years ago. I remember how freezing it was in the winter! Anyway, if you will be an Airbnb host in Boston, double-check your standard room stuff in the list might be beneficial. )
Distribution of Boston Airbnb Listing Price 2021
- Right skewed shape.
- $60-$80 is the most offering range,
- Overall, $80 — $160 are popular,
- $180–220 is also quite common. Since Boston is such a higher standard city.
- Listing offerings drop significantly after > $220.
Now, we enter the dragon: Machine Learning.
Drop the instances that price > $800 since they were rare in Boston Listings. Consider as an outlier, exceptional cases. Otherwise, the model will be confused with regularization.
“Special cases aren’t special enough to break the rules.” (Zen of Python)
We then need to split data into
- Independent Variables (All the variables that would impact Dependent Variable)
- Dependent Variable (In this case, the ‘price’)
For independent variables, we merge the numeric subset and categorical subset (One-Hot encoded).
For feature selections, I select only parts of variable in the numeric subset, e.g.:
‘accommodates’, ‘bedrooms’, ‘beds’, ‘number_of_reviews’, ‘review_scores_rating’. For example, some variables are related to REVIEW, so I picked an overall one for the review rating.
For feature selection of categorical variables, I selected ‘host_is_superhost’, ‘host_identity_verified’, ‘neighbourhood_cleansed’, ‘property_type’, ‘room_type’, ‘bathrooms_text’ (encoded versions)
(There are almost columns after all categorical variables flatten. If we use all of them for training, it might run into overfitting. So to better regularization, I only selected the above encoded categorical variables)
Choose a Linear Regression algorithm for this case. The imperative fitting is done by sklearn.linearmodel LinearRegression.
(For details of project code and libraries used, you can check my code of this project in GitHub)
After computation, we got a model fitted.
- R-squared for train data score: 0.62
- R-squared for test data score: 0.60
The model explains ~60% of the data variances. It could be a bare bone working model in real life.
However, for more accountability on predictability (aiming for R-squared > 0.7), I might need to train on more Boston past data and experiment with different features selection combinations. (The model used 3000 instances/rows after cleaning and hot-one encoded categorical variables) I might aim for 6000 workings (cleaned) samples to train.
Below is the leading coefficients output from the model.
This table shows some factors to consider when estimating Boston Airbnb prices. Most from the top are bathroom numbers, neighborhoods, property types. Interestingly enough, I found bathroom numbers occurrences more than others in front positions. So it might reveal that people value bathroom quality, bathroom independence so much when considering an Airbnb option in Boston.
Anyway, the coefficients serves as a reference, remember ‘Correlation does not imply causation’.
Boom! We have the first version of the Boston Airbnb Price Predictor model ready.
We can continuously improve the model by feeding more accurate data and experimenting with more feature selection and cleaning strategies, cycling the loop of CRISP-DM to enhance your business.
You can apply the same approach to other cities you favor.
e.g., Airbnb of New York, Airbnb of Tokyo, Airbnb of Shanghai, compare the nuances.
We haven’t deployed the model to the cloud yet. However, the model was tested and run locally. All technical parts and complete code I served in my GitHub.
For deployment like this one, I would recommend putting it on a Serverless FaaS for computing. FaaS is my recent favorite for deployment. Amazon Lambda could be a choice.
Thanks for reading.
We have walked through a CRISP-DM cycle from understanding a business problem to running a real-life model by mining a real dataset.
I am certainly not satisfied with 0.6 R-squared. It would be an excellent opportunity to improve this model when I get more data and try more experiments.
If you have any questions, feel free to contact me.
Alan Weiming Chen / Oct 16th 2021