Day 5 Indexing, Manipulation and Visualization Part -3



Now we are good with merge functions and sort function. Now let we look some Data Aggregation.

For Today process let us have a some new dataset , download from here. [source : Analytics Vidhya]

There are multiple functions we can use for useful aggregation for the data provided.

Some of them are 
  • groupby
  • crosstab
  • pivotable
import pandas as pd
import numpy as np
# read the dataset
data_BM = pd.read_csv('bigmart_data.csv')
# drop the null values
data_BM = data_BM.dropna(how="any")
# reset index after dropping
data_BM = data_BM.reset_index(drop=True)
# view the top results
data_BM.head()

Let's find out what is the mean price of each 'item_type' - column,
Use groupby function by passing 'item_type' as a parameter


# group price based on item type
price_by_item = data_BM.groupby('Item_Type')

# display first few rows
price_by_item.first()

Now we grouped the data based on item_type column, at next we should calculate the mean ( ).

# mean price by item 
price_by_item.Item_MRP.mean()

As similar to other functions we can pass multiple column in the groupby function using the square brackets.

 
# group on multiple columns
multiple_groups = data_BM[:10].groupby(['Item_Type', 'Item_Fat_Content'])
multiple_groups.first()

If you want to know more about groupby() , no one stop you from reading this documentation 

Accoording to crosstab it may comes under both visualization and aggregation of data. However it does not show the graphs or chart , it allows us to interpret the key difference between two factors

 
# generate crosstab of Outlet_Size and Outlet_Location_Type
pd.crosstab(data_BM["Outlet_Size"],data_BM["Outlet_Location_Type"],margins=True)

Interesting , Now lets find  how does the average sales differed from each year, using Pivot Tables.

Pandas Pivot tables are more and more richer than excel .
you can access it through one line of code.
By default Pivot has its own values , unless you want to  change except the data and index values.
# create pivot table
pd.pivot_table(data_BM, index=['Outlet_Establishment_Year'], values= "Item_Outlet_Sales")

# create pivot table
pd.pivot_table(data_BM, index=['Outlet_Establishment_Year', 'Outlet_Location_Type', 'Outlet_Size'], values= "Item_Outlet_Sales")


pd.pivot_table(data_BM, index=['Outlet_Establishment_Year', 'Outlet_Location_Type', 'Outlet_Size'], values= "Item_Outlet_Sales", aggfunc= [np.mean, np.median, min, max, np.std])

Thats it , now we have finished the manipulation and indexing sections.

Let's get into the 🐼 🐼  best and the most used feature in visualization. Soory irrespective of pandas you can use matplotlib for any datsets which can be read by any other libraries.

We'll use two new libraries and they are matplotlib and seaborn.

I hope you have already installed matplotlib.
For seaborn intsallation it's not much difficult to install.

Matplotlib
Matplotlib is chosen for its extensive use and high flexibility. Making plots or visualization which could be easily interpret is the most important need for data analysis.
We'll look for all the bars,charts, plots used in matplotlib.

We would create these charts by the end of today's post,
  • Line chart
  • Bar chart
  • Histogram
  • Box Plot
  • Violin Plot
  • Scatter Plot
  • Bubble Plot
Import matplotlib using a small variable or name.
import matplotlib as plt
# The below line is used to produce the visualized data within the notebook or else it would open in newtab in the browser.
%matplotlib inline

Here is some first few lines of code

import numpy as np
import pandas as pd
import matplotlib as plt
%matplotlib inline

# Create two lists
height =[150, 160, 165, 185]
weight = [70, 80, 90, 100]
# Draw the plot
plt.plot(height,weight)

We passed two list inside plot(), the first parameter appears in x axis and the second parameter appears in y axis.

Isn't that interetsing ? We would make it more interactive by adding Titles, labels and legends.
plt.title("Relationship between height and weight")
# Label for x axis
plt.xlabel("Height")
# label for y axis
plt.ylabel("Weight")

Time doesn't permits remaining will be updated by today 5 pm.

Recommended

Post a Comment