Introducing Pandas - Machine Learning Day 2

Introduction to Pandas

Pandas is a powerful python data analysis tool for visualizing, manipulating, filtering , reading and exporting the data.

Pandas is used by most of the data scientists and IT professional to analyze the data.

Pandas has many alternatives but we use pandas because it has more functionalities compared to others.

It has huge contribution and support from the community and pandas can be used by anyone as it is an open source library. It is built on the top of Numpy another package similar to pandas.

You can read different forms of data CSV files, json, and many other formats are supported by pandas.

Functions of filtering the data , selecting and manipulating are done easily.

Pandas can help you read different types of files and for better knowledge see the below table

Format Type Data Description Reader Writer
text CSV read_csv to_csv
text Fixed-Width Text File read_twf
text JSON read_json to_json
text HTML read_html to_html
text Local clipboard read_clipboard to_clipboard
MS Excel read_excel to_excel
binary OpenDocument read_excel
binary HDF5 Format read_hdf to_hdf
binary Feather Format read_feather to_feather
binary Parquet Format read_paraquet to_paraquet
binary ORC Format read_orc
binary Msgpack read_msgpack to_msgpack
binary Stata read_stata to_stata
binary SAS read_sas
binary SPSS read_spss
binary Python Pickle Format read_pickle    to_pickle
SQL SQL read_sql to_sql
SQL Google BigQuery read_gbq to_gbq

Mostly we will deal with CSV and Excel files.

Step 0 :

For reading excel files there is a requirement to add another dependence file.

On command prompt window type
pip install xlrd==1.2.0

Now, download the datasets file from here. [Source : Analytics vidhya]

Step 1:  Reading datasets with Pandas

     1.Open Jupyter Notebook by typing jupyter notebook in command prompt window

     2.Upload the dataset files to the jupyter notebook.
     3.Create a new python3 file and do the following commands separately.

#importing pandas library and naming it as pd for easy to use
import pandas as pd 

# assigning the read file from dataset to the df
df = pd.read_csv("data.csv")

#head() prints the top rows and columns nearly 5*5

df1 = pd.read_excel("data.xlsx")

Data frames are the structure of the data which is used in python. Pandas and SFrame are also a kind of dataframe. Each has its unique functions.

They are used to perform several operations

some of them are 
  • df.shape() which provides the dimensions ie.  rows x cols
  • df.head()  is used to access top of the data frame
  • df.tail() is used to access bottom of the data frame
  • df.columns is used to access all columns
  • df["column_name"] is used to access data in a specified column
  • df["column1","column2"] for accesing data of multiple columns
Try the above functions to perform different kinds of data filtering.


Post a Comment