Indexing, Manipulation and Visualization - Machine Learning Day 3

Day 3 Indexing, Manipulation and Visualization of data - Part 1

Indexing is referred as accessing data in a data frame in Python. It would be done within the square brackets. It is very similar to the string parsing in python.

There is no need for doing with any special datasets. Already a dataset is provided to you on day 2.

import pandas as pd
df = pd.read_csv("data.csv")
#printing values of specific column 'Fare'
#printing values of multiple columns ' Name' and 'Cabin'
#iloc is Integer index based so the first parameter would be considered as row and second one as column
# To print rows from 2 to 4
#To print rows from 0 to 4
#To print rows from 0 to 4 for all the columns 
#To print the values that match to the particular column
Change the indexing values and explore the ways to filter the data.
Try these ,
1. Access columns upto 4
2. Access values of rows 1 to 7 for 3,4,5 columns
3. Access values of any one column using indexing

Data Manipulation 

Data Manipulation in simple words organizing  the data in required structure. We gonna use some kind of new functions to manipulate the data and several operations such as sorting, merging. We are going to deal with some functions which used for finding missing data.

First of all look after the below codes and read comments for the explanation.

dropna(),isna(),notna(),fillna() are the functions used for accessing missing values in certain rows or columns.

As we use dropna( ) here, it is the function which drop the null values and there are several parameters for the dropna() 

you could find why the 'how' and 'any' is used in the below code. Just read the above image for comparison.

Sorting in pandas uses two main functions and they are
  • sort_values()
  • sort_index()

# Sorting a single column

import pandas as pd

import numpy as np

data_file = pd.read_csv(data.csv)

# drop the null values

data_file = data_file.dropna(how="any")

# view the top results


# sort by year

sorted_data = data_file.sort_values(by='Fare')

# print sorted data


# sort in place and descending order

data_file.sort_values(by='Outlet_Establishment_Year', ascending=False, inplace=True)


# inplace means the sorted values get updated in real data frame by default inplace is False

# Sorting Multiple columns

# As the existing data_file is sorted due to inplace=True we are again reading the csv file

data_file = pd.read_csv('data.csv')

data_file.sort_values(by=['Outlet_Establishment_Year', 'Item_Outlet_Sales'], ascending=False)[:5]

# changed the order of columns

data_file.sort_values(by=['Item_Outlet_Sales', 'Outlet_Establishment_Year'], ascending=False, inplace=True)


#you can see the difference between the datasets 

# Sort using Row index

# sort by index



Merging Data frames

Concat() and Merge() are the two useful functions for merging Data Frames

Let us create three Data Frame by ourselves and combine all,

Below is the syntax for creating a data frame and we hence created three data frames

df1 = pd.DataFrame({'A': ['1', 'A1', 'A2', 'A3'],
                     'B': ['B0', 'B1', 'B2', 'B3'],
                     'C': ['C0', 'C1', 'C2', 'C3'],
                     'D': ['D0', 'D1', 'D2', 'D3']},
                    index=[0, 1, 2, 3])

df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                     'B': ['B4', 'B5', 'B6', 'B7'],
                     'C': ['C4', 'C5', 'C6', 'C7'],
                     'D': ['D4', 'D5', 'D6', 'D7']},
                    index=[4, 5, 6, 7])
df3 = pd.DataFrame({'A': ['A8', 'A9', 'A10', 'A11'],
                     'B': ['B8', 'B9', 'B10', 'B11'],
                     'C': ['C8', 'C9', 'C10', 'C11'],
                     'D': ['D8', 'D9', 'D10', 'D11']},
                    index=[8, 9, 10, 11])

Now Let's combine the data into a single data frame using concat()

# combine dataframes

result = pd.concat([df1, df2, df3])

#The data which combined would be the result

As similar to Dictionaries, we can able to add keys or labels to the each data frame, which we would like to combine
for instance 

# combine dataframes with labels
result = pd.concat([ df1, df2, df3 ],keys = ['A' , ' B' , 'C' ])

Now the label would appear on the left hand side of the data frame.
Using loc attribute we could easily access the data frame with keys.

# This code would print the values linked with the key 'A'

result.loc[ 'A' ]


Post a Comment