Pandas Data Cleaning: Remove rows with empty data or missing values
A trivia task that for data cleaning or machine learning model preparation is to remove rows or columns that have empty data or missing values.
Example of removing rows.
Here are two quick ways to do it.
- Use dropna() that when you don’t care which rows, just drop them.
df.dropna(axis='rows')
2. Use drop() with a list of row numbers. Sometimes you do care which rows. Keep a copy of row numbers that has NaN. So we define a custom empty_rows() helper function first.
def get_empty_rows(df):
indices = []
for i, s in df.iterrows():
for v in s.values:
if str(v).lower() == 'nan':
if i not in indices:
indices.append(i)
return indices
If readability is really not what you concern, I also make a one liner function for you as below: (Remeber ‘Readability counts’ from Zen? I would say it is ok for a quick helper function as long as you documented properly)
def get_empty_rows(dataset):
# Nested 'for' loop and unique value list of 'set'
return list(set([i for i, s in dataset.iterrows() for v in s.values if str(v) == 'nan']))
(if the dataset is hugh, runing in iterrows() loops may cause performance issue. You should consider subset to chunk then proceed and merge end results)
Then we use drop() to take list of row number to remove. We keep a copy of the list, in case we want to audit it later or manually alter it. For example, some rows are still valuable even they were missing some data. (Always case by case, depends on your business logic)
list_of_empty_rows = get_empty_rows(df)
df.drop(list_of_empty_rows)
A more ‘pandas’ way is to to subset based on that condition, then get index list of that subset. .e.g
index_list = df[df['price'] < 20].indexindex_list = df[df[‘name’] == np.nan].indexindex_list = df[df['email'].isnull()].indexdf.drop(index_list)
This approach utilizes numpy vectorization feature. It actually performs much faster than above manual iteration. Recommended.
And another one-liner to subset any row(s) contain missing / empty value:
df[df.isnull().any(axis=1)]
Drop rows that with particular colums contain missing value:
df.dropna(subset=['costPrice', 'SuggesedPrice'])
It will remove rows whose ‘costPrice’ and/or ‘SuggessedPrice’ missing.
Dropping is quick, always bear in mind how to drop, whether to drop, should I drop.
Unnecessry dropping may impact your ultimate model test data and outcome.