Data Preprocessing: Understanding the most time-consuming process.

Akshar Rastogi
3 min readJun 15, 2021

Data occurs only after the data collection process is completed. Data collection is not necessarily easy going and constant process. Various sources and various parameters result in some errors in the collected Data this data is known as Raw Data but this not means that all our data comes with error but the one which comes can not be ignored as this will result in slumping the accuracy and efficiency of our model.

What is Data Preprocessing?

Operations on the raw data to make it model-ready is known as Data Preprocessing.

Almost every practical and rational data available has to be preprocessed in some way. Some datasets even have to go with heavy data preprocessing which accounts for a major chunk of time.

What is done in Data Preprocessing?

  • Knowing Our Data
  • Formatting Data Types
  • Dropping Unnecessary Features
  • Handling Missing Values
  • Training & Test Data Splitting
  • Visualization to understand the data

Datasets Used

  • Covid Dataset
  • NSE Data

Knowing Our Data

On having huge datasets it is not a feasible option to look for missing values in our data by scrolling through it. The same can be done through predefined functions. These are info() & describe().

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2964 entries, 0 to 2963
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Date 2964 non-null object
1 Open 2932 non-null float64
2 High 2932 non-null float64
3 Low 2932 non-null float64
4 Adj Close 2932 non-null float64
dtypes: float64(4), object(1) memory usage: 115.9+ KB
df.describe()

Formatting Data Types

If the data is not in the desired format or it is in some other format which does not compile with our operations we can change the data formats . The most common example of this is the Date Features.

w_d['date'] = pd.to_datetime(w_d['date'] , format='%Y-%m-%d')

Dropping Unnecessary Features

If there exist some features in our dataset that are irrelevant for our model or just create unwanted calculations or noise we can drop those columns. One approach is to check with correlation matrix the dependencies of feaures.

w_d = w_d.drop(['continent','female_smokers','male_smokers','handwashing_facilities','new_cases_smoothed','new_deaths_smoothed','new_cases_smoothed_per_million', 'new_deaths_smoothed_per_million','new_tests_smoothed','new_tests_smoothed_per_thousand', 'stringency_index','population','population_density' , 'median_age','aged_65_older','aged_70_older' , 'extreme_poverty','cardiovasc_death_rate','diabetes_prevalence', 'hospital_beds_per_thousand','life_expectancy','human_development_index'] , axis=1)

Handling Missing Values

It’s very common to experience missing values in your dataset. These missing values will create errors if operated through the model.

There are two approaches to handle missing values-

  • Eliminating The Rows of Missing values- By dropping the rows of missing values. This can cause a lot of crucial data to get eliminated which will itself create a problem.
  • Eliminating the Missing Values- by replacing the values with statistical replacements such as zeros, mean, median, mode, etc.
world_data = world_data.dropna(axis ='columns')
y_pred2 = y_pred2.fillna(y_pred2.mean)
#taking mean of missing values

Training & Test Data

Data has to be segmented into train and test data to train our model and check the accuracy of the model on test data.

from sklearn.model_selection import train_test_splitX_train , X_test , y_train ,y_test = train_test_split(x,y , test_size=0.3 , random_state=42)

Visualization to understand the data

Just seeing the numbers can make the statistical inference hard as it is very exceptional to observe features by just reading the numbers. In this scenario, statistical plotting plays an important role, for example the distplot() in seaborn can tell us the distribution of data.

sns.distplot(x=dfd['Close'])

There are numerous other approaches and methodologies for data preprocessing.

Notebooks Links-

--

--