An Introduction to Pandas

5.1. An Introduction to Pandas#

Pandas is an open-source Python library that offers robust data structures and tools for data analysis. It is designed to simplify and streamline data manipulation and analysis tasks. The name “Pandas” is derived from “Panel Data,” referring to multi-dimensional structured datasets commonly used in econometrics and finance [Pandas Developers, 2023].

Pandas provides a diverse set of functions and methods for data handling, including reading and writing data from various file formats like CSV, Excel, and SQL databases. It facilitates data cleaning, transformation, aggregation, and addressing missing values. Additionally, Pandas supports merging and joining datasets and offers functionality for time series operations [Pandas Developers, 2023].

Widely adopted in data science, machine learning, finance, economics, and numerous other fields, Pandas is favored for its ability to simplify and expedite data processing and analysis workflows. Its clear and intuitive syntax makes it accessible to both novices and experienced data scientists [Pandas Developers, 2023].

>>> pip install pandas

5.1.1. Data Structures:#

Series: The Series is akin to a one-dimensional labeled array, similar to a NumPy array but with an associated index. This index provides meaningful labels for each element in the Series, allowing for effortless data alignment and retrieval [Pandas Developers, 2023].
DataFrame: The DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can accommodate various data types. DataFrames provide a versatile and potent method for working with structured data, enabling operations such as filtering, joining, grouping, and more [Pandas Developers, 2023].

Example - Series: A Pandas Series object can be instantiated through the implementation of the pd.Series constructor.

import pandas as pd

# Create a Pandas Series with custom index
data = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])

# Print the Pandas Series
print("Pandas Series:")
print(data)

Pandas Series:
A    10
B    20
C    30
D    40
dtype: int64

Example - DataFrame: A Pandas DataFrame object can be created by utilizing the pd.DataFrame constructor.

import pandas as pd

# Create a DataFrame from a dictionary
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22],
        'City': ['Calgary', 'Edmonton', 'Red Deer']
        }

df = pd.DataFrame(data)

# Display the DataFrame
print("DataFrame:")
display(df)

DataFrame:

	Name	Age	City
0	Alice	25	Calgary
1	Bob	30	Edmonton
2	Charlie	22	Red Deer

5.1.2. Baisc Pandas DataFrame Commands#

Understanding basic Pandas commands is essential for efficient data manipulation and analysis. Below is a summary of common DataFrame commands and their descriptions:

pd.DataFrame(data): Creates a DataFrame from data such as a dictionary, array, or list.
data.info(): Displays basic information about the DataFrame, including data types and non-null counts.
data.head(n): Displays the first n rows of the DataFrame (default is 5).
data.tail(n): Displays the last n rows of the DataFrame (default is 5).
data.describe(): Displays summary statistics of numerical columns (count, mean, std, min, max, quartiles).
data.shape: Returns the number of rows and columns in the DataFrame as a tuple.
data.columns: Accesses the column labels of the DataFrame.

Table 5.1 Summary of Common Pandas DataFrame Commands#
Command	Description
`pd.DataFrame(data)`	Create a DataFrame from data like a dictionary, array, or list.
`data.info()`	Display basic information about the DataFrame, including data types and non-null counts.
`data.head(n)`	Display the first n rows of the DataFrame (default is 5).
`data.tail(n)`	Display the last n rows of the DataFrame (default is 5).
`data.describe()`	Display summary statistics of numerical columns (count, mean, std, min, max, quartiles).
`data.shape`	Returns the number of rows and columns in the DataFrame as a tuple.
`data.columns`	Access the column labels of the DataFrame.

Example:

import pandas as pd
import numpy as np
 
# Create the DataFrame
data = pd.DataFrame({'A': np.arange(0, 100),
                     'B': np.arange(1000, 900, -1)})

# Display basic DataFrame information
print("Displaying DataFrame Information:")
display(data)

# Get DataFrame information using data.info()
print("\nDataFrame Info:")
data.info()

# Display the first 10 rows
n = 10
print(f"\nDisplaying First {n} Rows:")
display(data.head(n))

# Display the last 10 rows
print(f"\nDisplaying Last {n} Rows:")
display(data.tail(n))

# Display summary statistics
print("\nSummary Statistics:")
display(data.describe())

# Get the shape of the DataFrame
print("\nDataFrame Shape:")
print("Number of rows and columns:", data.shape)

# Get column labels
print("\nColumn Labels:")
print("Column labels:", data.columns)

Displaying DataFrame Information:

	A	B
0	0	1000
1	1	999
2	2	998
3	3	997
4	4	996
...	...	...
95	95	905
96	96	904
97	97	903
98	98	902
99	99	901

100 rows × 2 columns

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       100 non-null    int32
 1   B       100 non-null    int32
dtypes: int32(2)
memory usage: 932.0 bytes

Displaying First 10 Rows:

	A	B
0	0	1000
1	1	999
2	2	998
3	3	997
4	4	996
5	5	995
6	6	994
7	7	993
8	8	992
9	9	991

Displaying Last 10 Rows:

	A	B
90	90	910
91	91	909
92	92	908
93	93	907
94	94	906
95	95	905
96	96	904
97	97	903
98	98	902
99	99	901

Summary Statistics:

	A	B
count	100.000000	100.000000
mean	49.500000	950.500000
std	29.011492	29.011492
min	0.000000	901.000000
25%	24.750000	925.750000
50%	49.500000	950.500000
75%	74.250000	975.250000
max	99.000000	1000.000000

DataFrame Shape:
Number of rows and columns: (100, 2)

Column Labels:
Column labels: Index(['A', 'B'], dtype='object')

Note

By default, the head() and tail() functions display the first or last 5 rows of the DataFrame. However, you can specify a different number of rows to display by passing an argument to these functions. You can change the number of rows displayed by passing an integer argument to the head() and tail() functions. For example, data.head(10) would display the first 10 rows of the DataFrame.

These are just a few basic commands in Pandas. The library offers a wide range of functions for data manipulation, exploration, and analysis. You can refer to the official Pandas documentation for more details and examples: Pandas Documentation.

5.1.3. Pandas Inputs/Outputs#

Pandas is a powerful Python library that provides data manipulation and analysis tools. It can read data from various file formats like CSV, Excel, SQL databases, and more [Pandas Developers, 2023].

5.1.3.1. Pickling#

Pickling is a way to serialize (convert to a byte stream) and deserialize (convert back to an object) Python objects. This is useful for saving complex data structures like DataFrames. Serialization allows you to save the state of an object to a file, which can be loaded back into memory later, preserving the object’s state.

Why Use Pickling?

Efficiency: Pickling is faster than other serialization formats like CSV or JSON because it directly converts Python objects into byte streams.
Convenience: It preserves the data types and structure of the DataFrame, making it easy to reload without additional parsing or type conversion.

Example:

import pandas as pd

# Create a DataFrame about Calgary temperatures
df = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar'], 'Temperature': [-1, 0, 5]})

# Save to a pickle file
df.to_pickle('calgary_temps.pkl')

# Load from a pickle file
df_loaded = pd.read_pickle('calgary_temps.pkl')
display(df_loaded)

	Month	Temperature
0	Jan	-1
1	Feb	0
2	Mar	5

In this example:

Creating a DataFrame: We create a simple DataFrame with months and corresponding temperatures.
Saving to a Pickle File: The to_pickle method serializes the DataFrame and saves it to a file named ‘calgary_temps.pkl’.
Loading from a Pickle File: The pd.read_pickle method deserializes the byte stream from the file back into a DataFrame.

5.1.3.2. Flat File#

Flat files are simple text files that store data in a plain text format. Common formats include CSV and TSV. These files are easy to read and write, making them a popular choice for data storage and transfer [Pandas Developers, 2023].

Why Use Flat Files?

Simplicity: Flat files are straightforward to create and manage.
Compatibility: They can be easily shared and opened with various software, including text editors and spreadsheet programs.
Efficiency: For small to medium-sized datasets, flat files are efficient and quick to process.

Example:

The data here is sourced from citypopulation.de.

import pandas as pd

# Create a DataFrame about Alberta population
df = pd.DataFrame({'City': ['Calgary', 'Edmonton', 'Red Deer'], 'Population': [1481466, 1128811, 109234]})

# Save to a CSV file
df.to_csv('alberta_population.csv', index=False)

# Load from a CSV file
df_loaded = pd.read_csv('alberta_population.csv')
display(df)

	City	Population
0	Calgary	1481466
1	Edmonton	1128811
2	Red Deer	109234

In this example:

Creating a DataFrame: We create a DataFrame with cities in Alberta and their populations.
Saving to a CSV File: The to_csv method saves the DataFrame to a file named ‘alberta_population.csv’.

Additional Options

When reading or writing data, Pandas offers various options to customize the behavior:

Reading CSV Files: You can specify delimiter, encoding, header presence, and more using read_csv().
Writing CSV Files: You can control the delimiter, whether to include headers, and more using the to_csv() method’s parameters.
Excel Files: For Excel files, you can specify the sheet name using the sheet_name parameter with to_excel() and read_excel().

For more details, refer to the Pandas IO documentation.

Keep in mind that Pandas is just one tool in the data manipulation and analysis ecosystem. Depending on your needs, other libraries like NumPy, Matplotlib/Seaborn for visualization, and scikit-learn for machine learning might be useful as well.

Example:

import pandas as pd

Link = 'https://download.microsoft.com/download/4/C/8/4C830C0C-101F-4BF2-8FCB-32D9A8BA906A/Import_User_Sample_en.csv'
# Read a CSV file into a DataFrame
data = pd.read_csv(Link)

# Display the DataFrame
print("The DataFrame:")
display(data)

The DataFrame:

	User Name	First Name	Last Name	Display Name	Job Title	Department	Office Number	Office Phone	Mobile Phone	Fax	Address	City	State or Province	ZIP or Postal Code	Country or Region
0	chris@contoso.com	Chris	Green	Chris Green	IT Manager	Information Technology	123451	123-555-1211	123-555-6641	123-555-9821	1 Microsoft way	Redmond	Wa	98052	United States
1	ben@contoso.com	Ben	Andrews	Ben Andrews	IT Manager	Information Technology	123452	123-555-1212	123-555-6642	123-555-9822	1 Microsoft way	Redmond	Wa	98052	United States
2	david@contoso.com	David	Longmuir	David Longmuir	IT Manager	Information Technology	123453	123-555-1213	123-555-6643	123-555-9823	1 Microsoft way	Redmond	Wa	98052	United States
3	cynthia@contoso.com	Cynthia	Carey	Cynthia Carey	IT Manager	Information Technology	123454	123-555-1214	123-555-6644	123-555-9824	1 Microsoft way	Redmond	Wa	98052	United States
4	melissa@contoso.com	Melissa	MacBeth	Melissa MacBeth	IT Manager	Information Technology	123455	123-555-1215	123-555-6645	123-555-9825	1 Microsoft way	Redmond	Wa	98052	United States

In this example:

Reading a CSV File: We read a CSV file from a URL into a DataFrame using the pd.read_csv method.
Displaying the DataFrame: The display function is used to show the DataFrame.

5.1.3.3. Clipboard#

You can copy data to and from the system clipboard using pandas. This is useful for quickly transferring data between applications [Pandas Developers, 2023].

Why Use Clipboard Operations?

Convenience: Easily transfer data between different applications without saving to a file.
Speed: Quickly copy and paste data for immediate use in other programs like spreadsheets or text editors.

Example:

import pandas as pd

# Create a DataFrame about Canadian provinces
df = pd.DataFrame({'Province': ['Alberta', 'British Columbia', 'Ontario'], 'Capital': ['Edmonton', 'Victoria', 'Toronto']})

# Copy to clipboard
df.to_clipboard(index=False)
display(df)

	Province	Capital
0	Alberta	Edmonton
1	British Columbia	Victoria
2	Ontario	Toronto

In this example:

Creating a DataFrame: We create a DataFrame with Canadian provinces and their capitals.
Copying to Clipboard: The to_clipboard method copies the DataFrame to the system clipboard, making it available for pasting into other applications.

5.1.3.4. Excel#

Pandas can read from and write to Excel files using the read_excel and to_excel functions. This functionality is particularly useful for data analysis and reporting, as Excel is a widely used format for data storage and exchange [Pandas Developers, 2023].

Why Use Excel with Pandas?

Versatility: Excel files can store multiple sheets, making it easy to organize and manage different datasets within a single file.
Compatibility: Excel is a common format in many industries, making it easy to share data with others who may not use Python.
Functionality: Pandas provides extensive options for reading and writing Excel files, allowing for customization and flexibility.

Example:

import pandas as pd

# Create a DataFrame about Calgary sports teams
df = pd.DataFrame({'Team': ['Flames', 'Stampeders', 'Roughnecks'], 'Sport': ['Hockey', 'Football', 'Lacrosse']})

# Save to an Excel file
df.to_excel('calgary_sports_teams.xlsx', index=False)
display(df)

	Team	Sport
0	Flames	Hockey
1	Stampeders	Football
2	Roughnecks	Lacrosse

In this example:

Creating a DataFrame: We create a DataFrame with Calgary sports teams and their respective sports.
Saving to an Excel File: The to_excel method saves the DataFrame to a file named ‘calgary_sports_teams.xlsx’.

Example:

import pandas as pd

link_to_file = r'https://open.alberta.ca/dataset/c45f2f78-0d6c-4a7e-98bc-313fbb232040/resource/1b346528-5188-47ee-8fef-b3d41e3db7c0/download/population-estimates-ab-quarterly-1951-to-current.xlsx'
# Load from an Excel file
df_loaded = pd.read_excel(link_to_file, skiprows=2, nrows=74)
display(df_loaded)

	Year	January 1	April 1	July 1	October 1
0	1951	--	--	943000.0	950000.0
1	1952	959000	965000	977000.0	985000.0
2	1953	994000	1002000	1016000.0	1029000.0
3	1954	1039000	1047000	1061000.0	1070000.0
4	1955	1078000	1084000	1094000.0	1102000.0
...	...	...	...	...	...
69	2020	4392958	4405455	4407495.0	4412013.0
70	2021	4418338	4425617	4431531.0	4450442.0
71	2022	4465537	4480956	4510891.0	4561350.0
72	2023	4598444	4645229	4695290.0	4756408.0
73	2024	4800768	4849906	NaN	NaN

74 rows × 5 columns

In this example:

Reading an Excel File: We read data from an Excel file located at a URL into a DataFrame using the pd.read_excel method.
Customizing the Read Operation: The skiprows parameter skips the first two rows, and the nrows parameter limits the read to 74 rows.

For more details, you can refer to the Pandas IO documentation and the specific methods read_excel and to_excel.

5.1.3.5. JSON#

JSON (JavaScript Object Notation) is a lightweight data interchange format. Pandas can read from and write to JSON files. JSON is commonly used for transmitting data in web applications [Pandas Developers, 2023].

Why Use JSON?

Human-Readable: JSON is easy to read and write for humans.
Interoperable: JSON is language-independent, making it a great choice for data interchange between different systems.

Example: The numbers are fictional.

import pandas as pd

# Create a DataFrame about Alberta parks
df = pd.DataFrame({'Park': ['Banff', 'Jasper', 'Elk Island'], 'Visitors': [4000000, 2000000, 500000]})

# Save to a JSON file
df.to_json('alberta_parks.json')

# Load from a JSON file
df_loaded = pd.read_json('alberta_parks.json')
display(df_loaded)

	Park	Visitors
0	Banff	4000000
1	Jasper	2000000
2	Elk Island	500000

In this example:

Creating a DataFrame: We create a DataFrame with parks in Alberta and their visitor numbers.
Saving to a JSON File: The to_json method saves the DataFrame to a file named ‘alberta_parks.json’.
Loading from a JSON File: The pd.read_json method reads the data from the JSON file back into a DataFrame.

For more details, you can refer to the Pandas JSON documentation.

5.1.3.6. HTML#

Pandas can read tables from HTML files and write DataFrames to HTML. This is useful for web scraping and exporting data to be displayed on web pages.

Why Use HTML?

Web Integration: HTML is the standard language for creating web pages, making it easy to integrate data with web content.
Versatility: You can extract data from any HTML table on the web.

Example:

# Import the Pandas library and alias it as 'pd'
import pandas as pd

# Use pd.read_html to extract tables from the specified URL
# [0] selects the first table on the webpage since read_html returns a list of DataFrames
Grades_UofC = pd.read_html('https://conted.ucalgary.ca/info/grades.jsp')[0]

# Display the DataFrame containing grades information from the University of Calgary
display(Grades_UofC)

Grades_UofC.to_html('grades_uofc.html')

	0	1	2
0	A+	95 – 100%	Outstanding
1	A	90 – 94%	Excellent Superior performance, showing compre...
2	A-	85 – 89%	Approaching Excellent
3	B+	80 – 84%	Exceeding Good
4	B	75 – 79%	Good Clearly above average performance with kn...
5	B-	70 – 74%	Approaching Good
6	C+	67 – 69%	Exceeding Satisfactory
7	C	64 – 66%	Satisfactory (minimal pass) Basic understandin...
8	C-	60 – 63%	Approaching Satisfactory Receipt of a C- or le...
9	D+	55 – 59%	Marginal Performance
10	D	50 – 54%	Minimal Performance
11	F	0 – 49%	Fail
12	AU	NaN	Course Audit No course credit. Permission to A...
13	CR	NaN	Completed Requirements
14	NC	NaN	Not Complete The NC grade is assigned to stude...
15	AT	NaN	Attended
16	NS	NaN	Not Subject to Grading

In this example:

Reading HTML Tables: The pd.read_html method extracts tables from a URL into a list of DataFrames.
Displaying the DataFrame: The display function shows the DataFrame.
Saving to an HTML File: The to_html method saves the DataFrame to an HTML file.

For more details, you can refer to the Pandas HTML documentation.

5.1.3.7. XML#

XML (eXtensible Markup Language) is a markup language that defines a set of rules for encoding documents. Pandas can read from and write to XML files, which are commonly used for data interchange and storage.

Why Use XML?

Structured Data: XML is designed to store and transport data, with a focus on what data is.
Flexibility: XML can represent complex data structures.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Save to an XML file
df.to_xml('data.xml')

# Load from an XML file
df_loaded = pd.read_xml('data.xml')
display(df_loaded)

	index	A	B
0	0	1	4
1	1	2	5
2	2	3	6

In this example:

Creating a DataFrame: We create a simple DataFrame.
Saving to an XML File: The to_xml method saves the DataFrame to a file named ‘data.xml’.
Loading from an XML File: The pd.read_xml method reads the data from the XML file back into a DataFrame.

For more details, you can refer to the Pandas XML documentation.

5.1.3.8. LaTeX#

LaTeX is a typesetting system commonly used for technical and scientific documentation. Pandas can convert DataFrames to LaTeX format, which is useful for including tables in LaTeX documents.

Why Use LaTeX?

Professional Formatting: LaTeX produces high-quality typeset documents.
Scientific Use: Widely used in academia for papers, theses, and books.

Example:

import pandas as pd

# Create a DataFrame about Alberta universities
df = pd.DataFrame({'University': ['University of Calgary', 'University of Alberta', 'Mount Royal University'],
                   'City': ['Calgary', 'Edmonton', 'Calgary']})

# Convert to LaTeX format
latex_str = df.to_latex()
print(latex_str)

\begin{tabular}{lll}
\toprule
 & University & City \\
\midrule
0 & University of Calgary & Calgary \\
1 & University of Alberta & Edmonton \\
2 & Mount Royal University & Calgary \\
\bottomrule
\end{tabular}

In this example:

Creating a DataFrame: We create a DataFrame with universities in Alberta and their cities.
Converting to LaTeX Format: The to_latex method converts the DataFrame to a LaTeX formatted string.

For more details, you can refer to the Pandas LaTeX documentation.

5.1.3.9. Other Inputs/Outputs#

SQL
- Database Integration: Standard language for relational databases.
- Efficiency: Efficiently handles large datasets.
- Pandas SQL documentation
HDFStore: PyTables (HDF5)
- High Performance: Optimized for fast I/O operations.
- Scalability: Suitable for large datasets.
- Pandas HDFStore documentation
Feather
- Speed: Designed for fast read and write operations.
- Interoperability: Usable with both Python and R.
- Pandas Feather documentation
Parquet
- Efficiency: Columnar storage is efficient for analytical queries.
- Compression: Highly compressed files.
- Pandas Parquet documentation
ORC
- Performance: High performance in big data environments.
- Compression: Efficiently compresses data.
- Pandas ORC documentation
SAS
- Advanced Analytics: Widely used for statistical analysis.
- Data Management: Efficiently handles large datasets.
- Pandas SAS documentation
SPSS
- Statistical Analysis: Designed for complex statistical analyses.
- Ease of Use: User-friendly interface.
- Pandas SPSS documentation
Google BigQuery
- Scalability: Handles large-scale data analysis.
- Serverless: Fully-managed data warehouse.
- Pandas BigQuery documentation

Table 5.2 summarizes the key inputs and outputs supported by Pandas for data manipulation and analysis.

Table 5.2 A summary table of the various Pandas inputs and outputs.#
Input/Output	Description	Example
Pickling	Serialization of DataFrames to byte streams.	`df.to_pickle('file.pkl')`
Flat File	Reading/writing CSV, TSV files.	`df.to_csv('file.csv')`
Clipboard	Copying data to/from clipboard.	`df.to_clipboard()`
Excel	Reading/writing Excel files.	`df.to_excel('file.xlsx')`
JSON	Reading/writing JSON files.	`df.to_json('file.json')`
HTML	Reading/writing HTML tables.	`df.to_html('file.html')`
XML	Reading/writing XML files.	`df.to_xml('file.xml')`
SQL	Reading/writing SQL databases.	`pd.read_sql('query', conn)`
HDFStore	Storing DataFrames in HDF5 format.	`df.to_hdf('file.h5', 'key')`
Feather	Fast binary file format for DataFrames.	`df.to_feather('file.feather')`
Parquet	Columnar storage file format.	`df.to_parquet('file.parquet')`
ORC	Optimized Row Columnar format.	`df.to_orc('file.orc')`
SAS	Reading SAS files.	`pd.read_sas('file.sas7bdat')`
SPSS	Reading SPSS files.	`pd.read_spss('file.sav')`
BigQuery	Reading/writing Google BigQuery.	`pd.read_gbq('query')`
STATA	Reading/writing STATA files.	`pd.read_stata('file.dta')`

	A	B
90	90	910
91	91	909
92	92	908
93	93	907
94	94	906
95	95	905
96	96	904
97	97	903
98	98	902
99	99	901

	A	B
90	90	910
91	91	909
92	92	908
93	93	907
94	94	906
95	95	905
96	96	904
97	97	903
98	98	902
99	99	901

An Introduction to Pandas

Contents

5.1. An Introduction to Pandas#

5.1.1. Data Structures:#

5.1.2. Baisc Pandas DataFrame Commands#

5.1.3. Pandas Inputs/Outputs#

5.1.3.1. Pickling#

5.1.3.2. Flat File#

5.1.3.3. Clipboard#

5.1.3.4. Excel#

5.1.3.5. JSON#

5.1.3.6. HTML#

5.1.3.7. XML#

5.1.3.8. LaTeX#

5.1.3.9. Other Inputs/Outputs#

	A	B
90	90	910
91	91	909
92	92	908
93	93	907
94	94	906
95	95	905
96	96	904
97	97	903
98	98	902
99	99	901