Inside Airbnb Boston 2021: Building a Machine Learning Price Predictor model for Boston Listings

Inside Airbnb Boston 2021: Building a Machine Learning Price Predictor model for Boston Listings

Building in Boston. Taken @ 2019.12.12

In this post, I will conduct a data mining effort by following CRISP-DM standards on the dataset of Airbnb Boston Listing 2021. Build a listing price estimator model with Machine Learning. Also, answer some business questions related to this dataset.

  • For people that are interested in how to build a price estimator or predictor model.
  • For Airbnb users (host, end-user, traveler) to better understand how to price their room, compare and select rooms within Boston, even negotiate with hosts.
  • For agency that wants to start Boston-based Airbnb business.
  • For real estate property people that want further discover patterns in the Boston property domain.

This post is intended for a non-technical overview.

To keep it simple, I abstract away the technical parts from this post.

I host this project code in GitHub, you can checkout full Jupyter notebook here:

Data can be found here:

Let’s get it stated.

To better organize the structure of this post, we follow this cycle.

  1. Business understanding
  2. Data Understanding
  3. Data Preparation
  4. Modelling
  5. Evaluation
  6. Deployment

Let’s define our Questions for understanding of below:

1) What are the most popular property types in Boston listings 2021?

2) What are the most popular / in-demand neighborhood in Boston listings 2021?

3) What are the standard amenities host are offerings in Boston listings 2021?

4) How is the Distribution of Price Range in Boston listings 2021?

5) Boston Pricing Model. What are some top factors that contribute to price?

For example, suppose you were a property investor, by understanding these questions, you might be able to develop a strategy to offer rooms in high-demand neighborhoods with popular property types, make sure those essential amenities are provided. Set a reasonable price checked by the model. Assure a better Airbnb rental income rate.

Covid 19 changes the tourism, rental industries dramatically. Things changed a lot and are more dynamic. So I tried to use the latest data available. At the time of writing, the dataset used in from Airbnb’s official September scraped batch. 2021-Sep

I wouldn’t combine past historical Boston data, especially pre-covid19.

The dataset has 3123 instance of listings, with 74 variables.

There are some missing values in variables like

‘neighbourhood_group_cleansed’, ‘bathrooms’, ‘calendar_updated’, ‘license’, ‘neighborhood_overview’, ‘neighbourhood’, ‘host_about’, ‘review_scores_value’, ‘beds’, ‘bedrooms’…

Note that Airbnb does not encode bathroom numbers numerically, so the entire bathroom column is all empty value. Instead, Airbnb stores as string text in another column ‘bathrooms_text’ as a categorical variable, storing values such as: ‘1 bath’, ‘1.5 shared bath’, ‘half private bath’

I drop those columns that contain a significantly amount of missing/empty values. e.g(> 75% missing)

For categorical columns, I fill the missing with the most common ones in that columns.

For some numeric columns, I fill the missing with median value rather than mean. (For example, we can’t have 1.37 beds, better to use the median in this case as an integer number) It would be more meaningful to fill with 0 for missing value as an initializer with a missing review rating score. It is probably not so fair to assign to ‘4.xx’ average rating for missing.

So for filling missing/empty data, always case by case depending on the attribute of that column, make sure what that variable represents.

Develop the fillna strategy for different variables would help to improve the model later we build on.

The dependent variable ‘price’ is stored as a U.S locale money format. e.g. ‘$1,234.00’. Before sending to computation, we need to transform to pure numeric format. ‘$1,234.00’ => 1234.00

There are quite a several categorical variables. One strategy is to use One-Hot Encoding. We generate new dummy columns for each variable’s value and encode with value 1 or 0 to mark the value.

  1. Subset numerics dataset.
  2. Subset categorical dataset and flatten it to a one-hot set.
  3. Re-combine numerics set and categorical (encoded) set.


(Image made with OmniGraffle)

Now we have a ready-to-roll cleaned dataset to feed to Machine Learning.

Now we come to the exciting part: Modeling, Mining.

Before we start building the model, I quickly answer the above questions.

  1. Dorchester 418
  2. Downtown 293
  3. Roxbury 263
  4. South End 235
  5. Brighton 232
  6. Jamaica Plain 215
  7. Back Bay 195
  8. East Boston 180
  9. Allston 158
  10. Beacon Hill 137
  11. South Boston 135
  12. Fenway 110
  13. North End 80
  14. Charlestown 70
  15. Roslindale 56
  16. Mission Hill 55
  17. Bay Village 50
  18. South Boston Waterfront 47
  19. Chinatown 43
  20. West End 38

(“Location, Location, Location”)

— -

  1. Entire rental unit 1283
  2. Private room in rental unit 526
  3. Private room in residential home 378
  4. Entire condominium (condo) 264
  5. Entire serviced apartment 184
  6. Entire residential home 126
  7. Private room in condominium (condo) 71
  8. Entire guest suite 47
  9. Private room in townhouse 38
  10. Room in boutique hotel 36

(Entire unit and private rooms of all kinds are top demand, seems like entire unit are pretty popular in Boston while other cities favor rooms more.)

— -

  1. Smoke alarm 3017
  2. Wifi 2996
  3. Long term stays allowed 2960
  4. Carbon monoxide alarm 2813
  5. Heating 2806
  6. Essentials 2804
  7. Kitchen 2795
  8. Hangers 2728
  9. Hair dryer 2581
  10. Air conditioning 2553
  11. Iron 2505
  12. Shampoo 2387
  13. Hot water 2291
  14. Microwave 2172
  15. Refrigerator 2151
  16. Washer 2097
  17. Dedicated workspace 2096
  18. Dryer 2048
  19. Coffee maker 1991
  20. Dishes and silverware 1916

(Those security-related are on the top: Smoke alarm and Carbon monoxide alarm. Of course, safety first, as always. WIFI, Heating, Long-Term Stays, Heating, follows by Air Con, etc. I once visited Boston 2 years ago. I remember how freezing it was in the winter! Anyway, if you will be an Airbnb host in Boston, double-check your standard room stuff in the list might be beneficial. )

  • Right skewed shape.
  • $60-$80 is the most offering range,
  • Overall, $80 — $160 are popular,
  • $180–220 is also quite common. Since Boston is such a higher standard city.
  • Listing offerings drop significantly after > $220.

Now, we enter the dragon: Machine Learning.

Lobster in a Boston restaurant (Union Oyster House). Taken @ 2019.12.12

Drop the instances that price > $800 since they were rare in Boston Listings. Consider as an outlier, exceptional cases. Otherwise, the model will be confused with regularization.

“Special cases aren’t special enough to break the rules.” (Zen of Python)

We then need to split data into

  1. Independent Variables (All the variables that would impact Dependent Variable)
  2. Dependent Variable (In this case, the ‘price’)

For independent variables, we merge the numeric subset and categorical subset (One-Hot encoded).

For feature selections, I select only parts of variable in the numeric subset, e.g.:

‘accommodates’, ‘bedrooms’, ‘beds’, ‘number_of_reviews’, ‘review_scores_rating’. For example, some variables are related to REVIEW, so I picked an overall one for the review rating.

For feature selection of categorical variables, I selected ‘host_is_superhost’, ‘host_identity_verified’, ‘neighbourhood_cleansed’, ‘property_type’, ‘room_type’, ‘bathrooms_text’ (encoded versions)

(There are almost columns after all categorical variables flatten. If we use all of them for training, it might run into overfitting. So to better regularization, I only selected the above encoded categorical variables)

Choose a Linear Regression algorithm for this case. The imperative fitting is done by sklearn.linearmodel LinearRegression.

(For details of project code and libraries used, you can check my code of this project in GitHub)

After computation, we got a model fitted.

  • R-squared for train data score: 0.62
  • R-squared for test data score: 0.60

The model explains ~60% of the data variances. It could be a bare bone working model in real life.

However, for more accountability on predictability (aiming for R-squared > 0.7), I might need to train on more Boston past data and experiment with different features selection combinations. (The model used 3000 instances/rows after cleaning and hot-one encoded categorical variables) I might aim for 6000 workings (cleaned) samples to train.

Below is the leading coefficients output from the model.

This table shows some factors to consider when estimating Boston Airbnb prices. Most from the top are bathroom numbers, neighborhoods, property types. Interestingly enough, I found bathroom numbers occurrences more than others in front positions. So it might reveal that people value bathroom quality, bathroom independence so much when considering an Airbnb option in Boston.

Anyway, the coefficients serves as a reference, remember ‘Correlation does not imply causation’.

Boom! We have the first version of the Boston Airbnb Price Predictor model ready.

We can continuously improve the model by feeding more accurate data and experimenting with more feature selection and cleaning strategies, cycling the loop of CRISP-DM to enhance your business.

You can apply the same approach to other cities you favor.

e.g., Airbnb of New York, Airbnb of Tokyo, Airbnb of Shanghai, compare the nuances.

We haven’t deployed the model to the cloud yet. However, the model was tested and run locally. All technical parts and complete code I served in my GitHub.

For deployment like this one, I would recommend putting it on a Serverless FaaS for computing. FaaS is my recent favorite for deployment. Amazon Lambda could be a choice.

Thanks for reading.

Street of Boston. Taken @ 2019.12.12

We have walked through a CRISP-DM cycle from understanding a business problem to running a real-life model by mining a real dataset.

I am certainly not satisfied with 0.6 R-squared. It would be an excellent opportunity to improve this model when I get more data and try more experiments.

If you have any questions, feel free to contact me.

Happy mining.

Alan Weiming Chen / Oct 16th 2021

(aka gilzero). I occasionally write about Software Development, Web Development, Machine Learning.