Inside Airbnb Boston 2021: Building a Machine Learning Price Predictor model for Boston Listings

8 min readOct 16, 2021

What is the article about?

In this post, I will conduct a data mining effort by following CRISP-DM standards on the dataset of Airbnb Boston Listing 2021. Build a listing price estimator model with Machine Learning. Also, answer some business questions related to this dataset.

Why should the reader learn this topic?

For people that are interested in how to build a price estimator or predictor model.
For Airbnb users (host, end-user, traveler) to better understand how to price their room, compare and select rooms within Boston, even negotiate with hosts.
For agency that wants to start Boston-based Airbnb business.
For real estate property people that want further discover patterns in the Boston property domain.

What are the prerequisites?

This post is intended for a non-technical overview.

To keep it simple, I abstract away the technical parts from this post.

I host this project code in GitHub, you can checkout full Jupyter notebook here:

GitHub - gilzero/airbnb_boston_2021

Project Background Author Data Source Libraries Used Motivation for The Project Files Summary of the results…

github.com

Data can be found here:

Inside Airbnb. Adding data to the debate.

Inside Airbnb is a mission driven activist project with the objective to: Provide data that quantifies the impact of…

insideairbnb.com

Let’s get it stated.

CRISP-DM

To better organize the structure of this post, we follow this cycle.

Business understanding
Data Understanding
Data Preparation
Modelling
Evaluation
Deployment

Business Understanding

Let’s define our Questions for understanding of below:

1) What are the most popular property types in Boston listings 2021?

2) What are the most popular / in-demand neighborhood in Boston listings 2021?

3) What are the standard amenities host are offerings in Boston listings 2021?

4) How is the Distribution of Price Range in Boston listings 2021?

5) Boston Pricing Model. What are some top factors that contribute to price?

For example, suppose you were a property investor, by understanding these questions, you might be able to develop a strategy to offer rooms in high-demand neighborhoods with popular property types, make sure those essential amenities are provided. Set a reasonable price checked by the model. Assure a better Airbnb rental income rate.

Data Understanding

Covid 19 changes the tourism, rental industries dramatically. Things changed a lot and are more dynamic. So I tried to use the latest data available. At the time of writing, the dataset used in from Airbnb’s official September scraped batch. 2021-Sep

I wouldn’t combine past historical Boston data, especially pre-covid19.

Data Preparation

The dataset has 3123 instance of listings, with 74 variables.

There are some missing values in variables like

‘neighbourhood_group_cleansed’, ‘bathrooms’, ‘calendar_updated’, ‘license’, ‘neighborhood_overview’, ‘neighbourhood’, ‘host_about’, ‘review_scores_value’, ‘beds’, ‘bedrooms’…

Note that Airbnb does not encode bathroom numbers numerically, so the entire bathroom column is all empty value. Instead, Airbnb stores as string text in another column ‘bathrooms_text’ as a categorical variable, storing values such as: ‘1 bath’, ‘1.5 shared bath’, ‘half private bath’

I drop those columns that contain a significantly amount of missing/empty values. e.g(> 75% missing)

For categorical columns, I fill the missing with the most common ones in that columns.

For some numeric columns, I fill the missing with median value rather than mean. (For example, we can’t have 1.37 beds, better to use the median in this case as an integer number) It would be more meaningful to fill with 0 for missing value as an initializer with a missing review rating score. It is probably not so fair to assign to ‘4.xx’ average rating for missing.

So for filling missing/empty data, always case by case depending on the attribute of that column, make sure what that variable represents.

Develop the fillna strategy for different variables would help to improve the model later we build on.

The dependent variable ‘price’ is stored as a U.S locale money format. e.g. ‘$1,234.00’. Before sending to computation, we need to transform to pure numeric format. ‘$1,234.00’ => 1234.00

There are quite a several categorical variables. One strategy is to use One-Hot Encoding. We generate new dummy columns for each variable’s value and encode with value 1 or 0 to mark the value.

Subset numerics dataset.
Subset categorical dataset and flatten it to a one-hot set.
Re-combine numerics set and categorical (encoded) set.

Flow:

(Image made with OmniGraffle)

Now we have a ready-to-roll cleaned dataset to feed to Machine Learning.

Modelling

Now we come to the exciting part: Modeling, Mining.

Before we start building the model, I quickly answer the above questions.

Popular neighborhoods for Boston Listings 2021

Dorchester 418
Downtown 293
Roxbury 263
South End 235
Brighton 232
Jamaica Plain 215
Back Bay 195
East Boston 180
Allston 158
Beacon Hill 137
South Boston 135
Fenway 110
North End 80
Charlestown 70
Roslindale 56
Mission Hill 55
Bay Village 50
South Boston Waterfront 47
Chinatown 43
West End 38

(“Location, Location, Location”)

— -

Popular (high demand) property types for Boston Listings 2021

Entire rental unit 1283
Private room in rental unit 526
Private room in residential home 378
Entire condominium (condo) 264
Entire serviced apartment 184
Entire residential home 126
Private room in condominium (condo) 71
Entire guest suite 47
Private room in townhouse 38
Room in boutique hotel 36

(Entire unit and private rooms of all kinds are top demand, seems like entire unit are pretty popular in Boston while other cities favor rooms more.)

— -

Common Essential Amenities

Smoke alarm 3017
Wifi 2996
Long term stays allowed 2960
Carbon monoxide alarm 2813
Heating 2806
Essentials 2804
Kitchen 2795
Hangers 2728
Hair dryer 2581
Air conditioning 2553
Iron 2505
Shampoo 2387
Hot water 2291
Microwave 2172
Refrigerator 2151
Washer 2097
Dedicated workspace 2096
Dryer 2048
Coffee maker 1991
Dishes and silverware 1916

(Those security-related are on the top: Smoke alarm and Carbon monoxide alarm. Of course, safety first, as always. WIFI, Heating, Long-Term Stays, Heating, follows by Air Con, etc. I once visited Boston 2 years ago. I remember how freezing it was in the winter! Anyway, if you will be an Airbnb host in Boston, double-check your standard room stuff in the list might be beneficial. )

—

Distribution of Boston Airbnb Listing Price 2021

Right skewed shape.
$60-$80 is the most offering range,
Overall, $80 — $160 are popular,
$180–220 is also quite common. Since Boston is such a higher standard city.
Listing offerings drop significantly after > $220.

—

Now, we enter the dragon: Machine Learning.

Lobster in a Boston restaurant (Union Oyster House). Taken @ 2019.12.12

Drop the instances that price > $800 since they were rare in Boston Listings. Consider as an outlier, exceptional cases. Otherwise, the model will be confused with regularization.

“Special cases aren’t special enough to break the rules.” (Zen of Python)

We then need to split data into

Independent Variables (All the variables that would impact Dependent Variable)
Dependent Variable (In this case, the ‘price’)

For independent variables, we merge the numeric subset and categorical subset (One-Hot encoded).

For feature selections, I select only parts of variable in the numeric subset, e.g.:

‘accommodates’, ‘bedrooms’, ‘beds’, ‘number_of_reviews’, ‘review_scores_rating’. For example, some variables are related to REVIEW, so I picked an overall one for the review rating.

For feature selection of categorical variables, I selected ‘host_is_superhost’, ‘host_identity_verified’, ‘neighbourhood_cleansed’, ‘property_type’, ‘room_type’, ‘bathrooms_text’ (encoded versions)

(There are almost columns after all categorical variables flatten. If we use all of them for training, it might run into overfitting. So to better regularization, I only selected the above encoded categorical variables)

Choose a Linear Regression algorithm for this case. The imperative fitting is done by sklearn.linearmodel LinearRegression.

(For details of project code and libraries used, you can check my code of this project in GitHub)

Evaluation

After computation, we got a model fitted.

R-squared for train data score: 0.62
R-squared for test data score: 0.60

The model explains ~60% of the data variances. It could be a bare bone working model in real life.

However, for more accountability on predictability (aiming for R-squared > 0.7), I might need to train on more Boston past data and experiment with different features selection combinations. (The model used 3000 instances/rows after cleaning and hot-one encoded categorical variables) I might aim for 6000 workings (cleaned) samples to train.

Below is the leading coefficients output from the model.

This table shows some factors to consider when estimating Boston Airbnb prices. Most from the top are bathroom numbers, neighborhoods, property types. Interestingly enough, I found bathroom numbers occurrences more than others in front positions. So it might reveal that people value bathroom quality, bathroom independence so much when considering an Airbnb option in Boston.

Anyway, the coefficients serves as a reference, remember ‘Correlation does not imply causation’.

Deployment

Boom! We have the first version of the Boston Airbnb Price Predictor model ready.

We can continuously improve the model by feeding more accurate data and experimenting with more feature selection and cleaning strategies, cycling the loop of CRISP-DM to enhance your business.

You can apply the same approach to other cities you favor.

e.g., Airbnb of New York, Airbnb of Tokyo, Airbnb of Shanghai, compare the nuances.

We haven’t deployed the model to the cloud yet. However, the model was tested and run locally. All technical parts and complete code I served in my GitHub.

For deployment like this one, I would recommend putting it on a Serverless FaaS for computing. FaaS is my recent favorite for deployment. Amazon Lambda could be a choice.

Conclusion

Thanks for reading.

We have walked through a CRISP-DM cycle from understanding a business problem to running a real-life model by mining a real dataset.

I am certainly not satisfied with 0.6 R-squared. It would be an excellent opportunity to improve this model when I get more data and try more experiments.

If you have any questions, feel free to contact me.

Happy mining.

Alan Weiming Chen / Oct 16th 2021

Inside Airbnb Boston 2021: Building a Machine Learning Price Predictor model for Boston Listings

What is the article about?

Why should the reader learn this topic?

What are the prerequisites?

GitHub - gilzero/airbnb_boston_2021

Project Background Author Data Source Libraries Used Motivation for The Project Files Summary of the results…

Inside Airbnb. Adding data to the debate.

Inside Airbnb is a mission driven activist project with the objective to: Provide data that quantifies the impact of…

CRISP-DM

Business Understanding

Data Understanding

Data Preparation

Modelling

Popular neighborhoods for Boston Listings 2021

Popular (high demand) property types for Boston Listings 2021

Common Essential Amenities

Distribution of Boston Airbnb Listing Price 2021

Evaluation

Deployment

Conclusion

Written by Weiming Chen