Just Basic Pandas(Bonus Included!)

Akshar Rastogi
3 min readJun 10, 2021

Every Machine Learning task begins with data. The data we obtain is not ready for model feeding it has to go through data preparation known as data-preprocessing.

Pandas is a very versatile and powerful data handling library in python. It is a must skill to be acquired for a Data Scientist and Machine Learning Engineers.

Pandas deals with DataFrames.

import pandas as pd
import numpy as np

Reading Data From a File

Pandas provide function to read file from many formats such as — csv, json, excel, html.

df = pd.read_csv('/content/sample_data/california_housing_test.csv')

Creating a Dataframe

df = pd.DataFrame({'a':np.random.rand(10),
'b':np.random.randint(10, size=10),
'c':[True,True,True,False,False,np.nan,np.nan,
False,True,True],
'b':['London','Paris','New York','Istanbul',
'Liverpool','Berlin',np.nan,'Madrid',
'Rome',np.nan],
'd':[3,4,5,1,5,2,2,np.nan,np.nan,0],
'e':[1,4,5,3,3,3,3,8,8,4]})
df

Understanding the Data

df.head()
df.tail()

df.head() & df.tail() shows us the 5 top and bottom rows which gives us an idea how the data frame looks like.

ip:df.shape
df.info()
op: <class 'pandas.core.frame.DataFrame'> RangeIndex: 3000 entries,
0 to 2999 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 longitude 3000 non-null float64 1 latitude 3000 non-null float64 2 housing_median_age 3000 non-null float64 3 total_rooms 3000 non-null float64 4 total_bedrooms 3000 non-null float64 5 population 3000 non-null float64 6 households 3000 non-null float64 7 median_income 3000 non-null float64 8 median_house_value 3000 non-null float64 dtypes: float64(9) memory usage: 211.1 KB

df.info() shows us what data types are there in our column and null values.

What are Null Values?

Null values are missing values that can create abnormalities in data and alter the calculations. They are also represented by NaN(Not any Number).

Dealing with Null Values

df.dropna() #drops all null values rows
df.fillna(0) #fill NaN values with zeros
df.fillna().mean() #fills NaN values with mean
df.replace(np.nan, 0) #replace with mean
df.replace(np.nan, df.column.mean()) #replace with mean

Index and Location Based Selection

iloc and loc allows selecting part of a DataFrame.

  • iloc: Select by position
  • loc: Select by label

iloc

df.iloc[1] 
longitude -118.300 latitude 34.260 housing_median_age 43.000 total_rooms 1510.000 total_bedrooms 310.000 population 809.000 households 277.000 median_income 3.599 median_house_value 176500.000 Name: 1, dtype: float64

loc

df.loc[:2,'total_rooms']
0 3885.0
1 1510.0
2 3589.0
Name: total_rooms, dtype: float64

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Bonus Library in Pandas-

It's always about time efficiency and Pandas Profiling is always there for you. On a personal note after getting habitual with pandas you must start using this. (Lazy learners avoid this!!!)

Pandas-Profiling generates a report of your data which makes you understand data deeply. It just gives your data a voice to narrate itself.

!pip install pandas-profiling

Generate a report

from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file='output.html')

But every coin has two sides, the shortcoming of pandas_profiling is that it can not be done with large datasets time efficiently.

--

--