Data Cleaning

Dhruv Khanna
3 min readJun 24, 2021

This Tutorial is part one of a series of tutorials on subprocesses that are done while working on a Data Science Project. I will try to answer all the questions that come into the mind of a data science rookie covering all the why, how, and what along the way. We will also be using the pandas library to work with the different datasets.

What is Data?

Data is “anything and everything” it is a collection of information.

From every click on the browser to every purchase record corresponding to customers. From Images to texts to videos of penguins. Any information that can be stored in our hard drives is data. The concept of data is simple but sometimes is mistakenly referred to as information. Both of them are different. Information is the useful bits that are extracted from data. It can be said that every piece of information is data while the reverse is not true.

What is Cleaning?

Cleaning is the removal of unwanted parts to make the space clean and pretty so that it can be easily be used.

Data Cleaning

It is a process where we refine the source data by applying various techniques and make it ready for further analysis and processing. It is the initial step of every data science project and acts like the preparation done before cooking.

Data Cleaning is done to get rid of unwanted data which causes errors and biases. Due to the inconsistency of the source data, it becomes essential as we cannot consider that the data is tailormade for further processes, and even so we check it as a part of best practices.

Data Cleaning is used in fields like data analytics, machine learning, and data science. It is one of the most crucial steps in all the above-mentioned fields as it is the most time-consuming. The reason it becomes time-consuming is due to the fact we have to treat each feature in the data differently and they can respond differently to different practices.

Here you would learn about the following:

  1. Handling Missing Values
  2. Scaling and Normalization
  3. Parsing Dates
  4. Character Encodings
  5. Dealing with Inconsistent Data Entry

Handling Missing Values

What is a missing value?

It is a data point where some data was supposed to be present but is empty due to any reason.

What is the easiest way of dealing with missing values?

Simple just delete the corresponding record. But it’s an approach that is applied by the common man, not us.

How does a Data Scientist work with missing values?

Before going further we must know about the main types of data you might work upon and how we can deal with missing values in them.

Data can be basically categorized into three types:

  1. Categorical Data
  2. Numerical Data (Sometimes what you think is numerical data might be categorical data)
  3. Other (This includes images, texts, and all other formats of data that are unstructured)

Here we will be looking into just the first 2:

Dealing with Categorical Data

There are a lot of different ways of dealing with categorical data but here we will be looking at some of the more common ways. Also if you want to take a deeper dive you can check it here.

--

--