Just Basic Pandas(Bonus Included!)

3 min readJun 10, 2021

Every Machine Learning task begins with data. The data we obtain is not ready for model feeding it has to go through data preparation known as data-preprocessing.

Pandas is a very versatile and powerful data handling library in python. It is a must skill to be acquired for a Data Scientist and Machine Learning Engineers.

Pandas deals with DataFrames.

import pandas as pd
import numpy as np

Reading Data From a File

Pandas provide function to read file from many formats such as — csv, json, excel, html.

df = pd.read_csv('/content/sample_data/california_housing_test.csv')

Creating a Dataframe

df = pd.DataFrame({'a':np.random.rand(10),
                 'b':np.random.randint(10, size=10),
                 'c':[True,True,True,False,False,np.nan,np.nan,
                      False,True,True],
                 'b':['London','Paris','New York','Istanbul',
                      'Liverpool','Berlin',np.nan,'Madrid',
                      'Rome',np.nan],
                 'd':[3,4,5,1,5,2,2,np.nan,np.nan,0],
                 'e':[1,4,5,3,3,3,3,8,8,4]})
df

Understanding the Data

df.head()
df.tail()

df.head() & df.tail() shows us the 5 top and bottom rows which gives us an idea how the data frame looks like.

ip:df.shape
   df.info()op: <class 'pandas.core.frame.DataFrame'> RangeIndex: 3000 entries,     
0 to 2999 Data columns (total 9 columns):  #   Column              Non-Null Count  Dtype   ---  ------              --------------  -----    0   longitude           3000 non-null   float64  1   latitude            3000 non-null   float64  2   housing_median_age  3000 non-null   float64  3   total_rooms         3000 non-null   float64  4   total_bedrooms      3000 non-null   float64  5   population          3000 non-null   float64  6   households          3000 non-null   float64  7   median_income       3000 non-null   float64  8   median_house_value  3000 non-null   float64 dtypes: float64(9) memory usage: 211.1 KB

df.info() shows us what data types are there in our column and null values.

What are Null Values?

Null values are missing values that can create abnormalities in data and alter the calculations. They are also represented by NaN(Not any Number).

Dealing with Null Values

df.dropna() #drops all null values rows
df.fillna(0) #fill NaN values with zeros
df.fillna().mean() #fills NaN values with meandf.replace(np.nan, 0) #replace with mean
df.replace(np.nan, df.column.mean()) #replace with mean

Index and Location Based Selection

iloc and loc allows selecting part of a DataFrame.

iloc: Select by position
loc: Select by label

iloc

df.iloc[1] 
longitude               -118.300 latitude                  34.260 housing_median_age        43.000 total_rooms             1510.000 total_bedrooms           310.000 population               809.000 households               277.000 median_income              3.599 median_house_value    176500.000 Name: 1, dtype: float64

loc

df.loc[:2,'total_rooms']
0    3885.0 
1    1510.0 
2    3589.0
Name: total_rooms, dtype: float64

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Bonus Library in Pandas-

It's always about time efficiency and Pandas Profiling is always there for you. On a personal note after getting habitual with pandas you must start using this. (Lazy learners avoid this!!!)

Pandas-Profiling generates a report of your data which makes you understand data deeply. It just gives your data a voice to narrate itself.

!pip install pandas-profiling

Generate a report

from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file='output.html')

But every coin has two sides, the shortcoming of pandas_profiling is that it can not be done with large datasets time efficiently.

Just Basic Pandas(Bonus Included!)

Written by Akshar Rastogi