5.1. An Introduction to Pandas#

Pandas is an open-source Python library that offers robust data structures and tools for data analysis. It is designed to simplify and streamline data manipulation and analysis tasks. The name “Pandas” is derived from “Panel Data,” referring to multi-dimensional structured datasets commonly used in econometrics and finance [Pandas Developers, 2023].

Pandas provides a diverse set of functions and methods for data handling, including reading and writing data from various file formats like CSV, Excel, and SQL databases. It facilitates data cleaning, transformation, aggregation, and addressing missing values. Additionally, Pandas supports merging and joining datasets and offers functionality for time series operations [Pandas Developers, 2023].

Widely adopted in data science, machine learning, finance, economics, and numerous other fields, Pandas is favored for its ability to simplify and expedite data processing and analysis workflows. Its clear and intuitive syntax makes it accessible to both novices and experienced data scientists [Pandas Developers, 2023].

>>> pip install pandas

5.1.1. Data Structures:#

  1. Series: The Series is akin to a one-dimensional labeled array, similar to a NumPy array but with an associated index. This index provides meaningful labels for each element in the Series, allowing for effortless data alignment and retrieval [Pandas Developers, 2023].

  2. DataFrame: The DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can accommodate various data types. DataFrames provide a versatile and potent method for working with structured data, enabling operations such as filtering, joining, grouping, and more [Pandas Developers, 2023].

Example - Series: A Pandas Series object can be instantiated through the implementation of the pd.Series constructor.

import pandas as pd

# Create a Pandas Series with custom index
data = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])

# Print the Pandas Series
print("Pandas Series:")
print(data)
Pandas Series:
A    10
B    20
C    30
D    40
dtype: int64

Example - DataFrame: A Pandas DataFrame object can be created by utilizing the pd.DataFrame constructor.

import pandas as pd

# Create a DataFrame from a dictionary
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 22],
        'City': ['Calgary', 'Edmonton', 'Red Deer']
        }

df = pd.DataFrame(data)

# Display the DataFrame
print("DataFrame:")
display(df)
DataFrame:
Name Age City
0 Alice 25 Calgary
1 Bob 30 Edmonton
2 Charlie 22 Red Deer

5.1.2. Baisc Pandas DataFrame Commands#

Understanding basic Pandas commands is essential for efficient data manipulation and analysis. Below is a summary of common DataFrame commands and their descriptions:

  1. pd.DataFrame(data): Creates a DataFrame from data such as a dictionary, array, or list.

  2. data.info(): Displays basic information about the DataFrame, including data types and non-null counts.

  3. data.head(n): Displays the first n rows of the DataFrame (default is 5).

  4. data.tail(n): Displays the last n rows of the DataFrame (default is 5).

  5. data.describe(): Displays summary statistics of numerical columns (count, mean, std, min, max, quartiles).

  6. data.shape: Returns the number of rows and columns in the DataFrame as a tuple.

  7. data.columns: Accesses the column labels of the DataFrame.

Table 5.1 Summary of Common Pandas DataFrame Commands#

Command

Description

pd.DataFrame(data)

Create a DataFrame from data like a dictionary, array, or list.

data.info()

Display basic information about the DataFrame, including data types and non-null counts.

data.head(n)

Display the first n rows of the DataFrame (default is 5).

data.tail(n)

Display the last n rows of the DataFrame (default is 5).

data.describe()

Display summary statistics of numerical columns (count, mean, std, min, max, quartiles).

data.shape

Returns the number of rows and columns in the DataFrame as a tuple.

data.columns

Access the column labels of the DataFrame.

Example:

import pandas as pd
import numpy as np
 
# Create the DataFrame
data = pd.DataFrame({'A': np.arange(0, 100),
                     'B': np.arange(1000, 900, -1)})

# Display basic DataFrame information
print("Displaying DataFrame Information:")
display(data)

# Get DataFrame information using data.info()
print("\nDataFrame Info:")
data.info()

# Display the first 10 rows
n = 10
print(f"\nDisplaying First {n} Rows:")
display(data.head(n))

# Display the last 10 rows
print(f"\nDisplaying Last {n} Rows:")
display(data.tail(n))

# Display summary statistics
print("\nSummary Statistics:")
display(data.describe())

# Get the shape of the DataFrame
print("\nDataFrame Shape:")
print("Number of rows and columns:", data.shape)

# Get column labels
print("\nColumn Labels:")
print("Column labels:", data.columns)
Displaying DataFrame Information:
A B
0 0 1000
1 1 999
2 2 998
3 3 997
4 4 996
... ... ...
95 95 905
96 96 904
97 97 903
98 98 902
99 99 901

100 rows × 2 columns

DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   A       100 non-null    int32
 1   B       100 non-null    int32
dtypes: int32(2)
memory usage: 932.0 bytes

Displaying First 10 Rows:
A B
0 0 1000
1 1 999
2 2 998
3 3 997
4 4 996
5 5 995
6 6 994
7 7 993
8 8 992
9 9 991
Displaying Last 10 Rows:
A B
90 90 910
91 91 909
92 92 908
93 93 907
94 94 906
95 95 905
96 96 904
97 97 903
98 98 902
99 99 901
Summary Statistics:
A B
count 100.000000 100.000000
mean 49.500000 950.500000
std 29.011492 29.011492
min 0.000000 901.000000
25% 24.750000 925.750000
50% 49.500000 950.500000
75% 74.250000 975.250000
max 99.000000 1000.000000
DataFrame Shape:
Number of rows and columns: (100, 2)

Column Labels:
Column labels: Index(['A', 'B'], dtype='object')

Note

By default, the head() and tail() functions display the first or last 5 rows of the DataFrame. However, you can specify a different number of rows to display by passing an argument to these functions. You can change the number of rows displayed by passing an integer argument to the head() and tail() functions. For example, data.head(10) would display the first 10 rows of the DataFrame.

These are just a few basic commands in Pandas. The library offers a wide range of functions for data manipulation, exploration, and analysis. You can refer to the official Pandas documentation for more details and examples: Pandas Documentation.

5.1.3. Pandas Inputs/Outputs#

Pandas is a powerful Python library that provides data manipulation and analysis tools. It can read data from various file formats like CSV, Excel, SQL databases, and more [Pandas Developers, 2023].

5.1.3.1. Pickling#

Pickling is a way to serialize (convert to a byte stream) and deserialize (convert back to an object) Python objects. This is useful for saving complex data structures like DataFrames. Serialization allows you to save the state of an object to a file, which can be loaded back into memory later, preserving the object’s state.

Why Use Pickling?

  • Efficiency: Pickling is faster than other serialization formats like CSV or JSON because it directly converts Python objects into byte streams.

  • Convenience: It preserves the data types and structure of the DataFrame, making it easy to reload without additional parsing or type conversion.

Example:

import pandas as pd

# Create a DataFrame about Calgary temperatures
df = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar'], 'Temperature': [-1, 0, 5]})

# Save to a pickle file
df.to_pickle('calgary_temps.pkl')

# Load from a pickle file
df_loaded = pd.read_pickle('calgary_temps.pkl')
display(df_loaded)
Month Temperature
0 Jan -1
1 Feb 0
2 Mar 5

In this example:

  1. Creating a DataFrame: We create a simple DataFrame with months and corresponding temperatures.

  2. Saving to a Pickle File: The to_pickle method serializes the DataFrame and saves it to a file named ‘calgary_temps.pkl’.

  3. Loading from a Pickle File: The pd.read_pickle method deserializes the byte stream from the file back into a DataFrame.

5.1.3.2. Flat File#

Flat files are simple text files that store data in a plain text format. Common formats include CSV and TSV. These files are easy to read and write, making them a popular choice for data storage and transfer [Pandas Developers, 2023].

Why Use Flat Files?

  • Simplicity: Flat files are straightforward to create and manage.

  • Compatibility: They can be easily shared and opened with various software, including text editors and spreadsheet programs.

  • Efficiency: For small to medium-sized datasets, flat files are efficient and quick to process.

Example:

The data here is sourced from citypopulation.de.

import pandas as pd

# Create a DataFrame about Alberta population
df = pd.DataFrame({'City': ['Calgary', 'Edmonton', 'Red Deer'], 'Population': [1481466, 1128811, 109234]})

# Save to a CSV file
df.to_csv('alberta_population.csv', index=False)

# Load from a CSV file
df_loaded = pd.read_csv('alberta_population.csv')
display(df)
City Population
0 Calgary 1481466
1 Edmonton 1128811
2 Red Deer 109234

In this example:

  1. Creating a DataFrame: We create a DataFrame with cities in Alberta and their populations.

  2. Saving to a CSV File: The to_csv method saves the DataFrame to a file named ‘alberta_population.csv’.

Additional Options

When reading or writing data, Pandas offers various options to customize the behavior:

  • Reading CSV Files: You can specify delimiter, encoding, header presence, and more using read_csv().

  • Writing CSV Files: You can control the delimiter, whether to include headers, and more using the to_csv() method’s parameters.

  • Excel Files: For Excel files, you can specify the sheet name using the sheet_name parameter with to_excel() and read_excel().

For more details, refer to the Pandas IO documentation.

Keep in mind that Pandas is just one tool in the data manipulation and analysis ecosystem. Depending on your needs, other libraries like NumPy, Matplotlib/Seaborn for visualization, and scikit-learn for machine learning might be useful as well.

Example:

import pandas as pd

Link = 'https://download.microsoft.com/download/4/C/8/4C830C0C-101F-4BF2-8FCB-32D9A8BA906A/Import_User_Sample_en.csv'
# Read a CSV file into a DataFrame
data = pd.read_csv(Link)

# Display the DataFrame
print("The DataFrame:")
display(data)
The DataFrame:
User Name First Name Last Name Display Name Job Title Department Office Number Office Phone Mobile Phone Fax Address City State or Province ZIP or Postal Code Country or Region
0 chris@contoso.com Chris Green Chris Green IT Manager Information Technology 123451 123-555-1211 123-555-6641 123-555-9821 1 Microsoft way Redmond Wa 98052 United States
1 ben@contoso.com Ben Andrews Ben Andrews IT Manager Information Technology 123452 123-555-1212 123-555-6642 123-555-9822 1 Microsoft way Redmond Wa 98052 United States
2 david@contoso.com David Longmuir David Longmuir IT Manager Information Technology 123453 123-555-1213 123-555-6643 123-555-9823 1 Microsoft way Redmond Wa 98052 United States
3 cynthia@contoso.com Cynthia Carey Cynthia Carey IT Manager Information Technology 123454 123-555-1214 123-555-6644 123-555-9824 1 Microsoft way Redmond Wa 98052 United States
4 melissa@contoso.com Melissa MacBeth Melissa MacBeth IT Manager Information Technology 123455 123-555-1215 123-555-6645 123-555-9825 1 Microsoft way Redmond Wa 98052 United States

In this example:

  1. Reading a CSV File: We read a CSV file from a URL into a DataFrame using the pd.read_csv method.

  2. Displaying the DataFrame: The display function is used to show the DataFrame.

5.1.3.3. Clipboard#

You can copy data to and from the system clipboard using pandas. This is useful for quickly transferring data between applications [Pandas Developers, 2023].

Why Use Clipboard Operations?

  • Convenience: Easily transfer data between different applications without saving to a file.

  • Speed: Quickly copy and paste data for immediate use in other programs like spreadsheets or text editors.

Example:

import pandas as pd

# Create a DataFrame about Canadian provinces
df = pd.DataFrame({'Province': ['Alberta', 'British Columbia', 'Ontario'], 'Capital': ['Edmonton', 'Victoria', 'Toronto']})

# Copy to clipboard
df.to_clipboard(index=False)
display(df)
Province Capital
0 Alberta Edmonton
1 British Columbia Victoria
2 Ontario Toronto

In this example:

  1. Creating a DataFrame: We create a DataFrame with Canadian provinces and their capitals.

  2. Copying to Clipboard: The to_clipboard method copies the DataFrame to the system clipboard, making it available for pasting into other applications.

5.1.3.4. Excel#

Pandas can read from and write to Excel files using the read_excel and to_excel functions. This functionality is particularly useful for data analysis and reporting, as Excel is a widely used format for data storage and exchange [Pandas Developers, 2023].

Why Use Excel with Pandas?

  • Versatility: Excel files can store multiple sheets, making it easy to organize and manage different datasets within a single file.

  • Compatibility: Excel is a common format in many industries, making it easy to share data with others who may not use Python.

  • Functionality: Pandas provides extensive options for reading and writing Excel files, allowing for customization and flexibility.

Example:

import pandas as pd

# Create a DataFrame about Calgary sports teams
df = pd.DataFrame({'Team': ['Flames', 'Stampeders', 'Roughnecks'], 'Sport': ['Hockey', 'Football', 'Lacrosse']})

# Save to an Excel file
df.to_excel('calgary_sports_teams.xlsx', index=False)
display(df)
Team Sport
0 Flames Hockey
1 Stampeders Football
2 Roughnecks Lacrosse

In this example:

  1. Creating a DataFrame: We create a DataFrame with Calgary sports teams and their respective sports.

  2. Saving to an Excel File: The to_excel method saves the DataFrame to a file named ‘calgary_sports_teams.xlsx’.

Example:

import pandas as pd

link_to_file = r'https://open.alberta.ca/dataset/c45f2f78-0d6c-4a7e-98bc-313fbb232040/resource/1b346528-5188-47ee-8fef-b3d41e3db7c0/download/population-estimates-ab-quarterly-1951-to-current.xlsx'
# Load from an Excel file
df_loaded = pd.read_excel(link_to_file, skiprows=2, nrows=74)
display(df_loaded)
Year January 1 April 1 July 1 October 1
0 1951 -- -- 943000.0 950000.0
1 1952 959000 965000 977000.0 985000.0
2 1953 994000 1002000 1016000.0 1029000.0
3 1954 1039000 1047000 1061000.0 1070000.0
4 1955 1078000 1084000 1094000.0 1102000.0
... ... ... ... ... ...
69 2020 4392958 4405455 4407495.0 4412013.0
70 2021 4418338 4425617 4431531.0 4450442.0
71 2022 4465537 4480956 4510891.0 4561350.0
72 2023 4598444 4645229 4695290.0 4756408.0
73 2024 4800768 4849906 NaN NaN

74 rows × 5 columns

In this example:

  1. Reading an Excel File: We read data from an Excel file located at a URL into a DataFrame using the pd.read_excel method.

  2. Customizing the Read Operation: The skiprows parameter skips the first two rows, and the nrows parameter limits the read to 74 rows.

For more details, you can refer to the Pandas IO documentation and the specific methods read_excel and to_excel.

5.1.3.5. JSON#

JSON (JavaScript Object Notation) is a lightweight data interchange format. Pandas can read from and write to JSON files. JSON is commonly used for transmitting data in web applications [Pandas Developers, 2023].

Why Use JSON?

  • Human-Readable: JSON is easy to read and write for humans.

  • Interoperable: JSON is language-independent, making it a great choice for data interchange between different systems.

Example: The numbers are fictional.

import pandas as pd

# Create a DataFrame about Alberta parks
df = pd.DataFrame({'Park': ['Banff', 'Jasper', 'Elk Island'], 'Visitors': [4000000, 2000000, 500000]})

# Save to a JSON file
df.to_json('alberta_parks.json')

# Load from a JSON file
df_loaded = pd.read_json('alberta_parks.json')
display(df_loaded)
Park Visitors
0 Banff 4000000
1 Jasper 2000000
2 Elk Island 500000

In this example:

  1. Creating a DataFrame: We create a DataFrame with parks in Alberta and their visitor numbers.

  2. Saving to a JSON File: The to_json method saves the DataFrame to a file named ‘alberta_parks.json’.

  3. Loading from a JSON File: The pd.read_json method reads the data from the JSON file back into a DataFrame.

For more details, you can refer to the Pandas JSON documentation.

5.1.3.6. HTML#

Pandas can read tables from HTML files and write DataFrames to HTML. This is useful for web scraping and exporting data to be displayed on web pages.

Why Use HTML?

  • Web Integration: HTML is the standard language for creating web pages, making it easy to integrate data with web content.

  • Versatility: You can extract data from any HTML table on the web.

Example:

# Import the Pandas library and alias it as 'pd'
import pandas as pd

# Use pd.read_html to extract tables from the specified URL
# [0] selects the first table on the webpage since read_html returns a list of DataFrames
Grades_UofC = pd.read_html('https://conted.ucalgary.ca/info/grades.jsp')[0]

# Display the DataFrame containing grades information from the University of Calgary
display(Grades_UofC)

Grades_UofC.to_html('grades_uofc.html')
0 1 2
0 A+ 95 – 100% Outstanding
1 A 90 – 94% Excellent Superior performance, showing compre...
2 A- 85 – 89% Approaching Excellent
3 B+ 80 – 84% Exceeding Good
4 B 75 – 79% Good Clearly above average performance with kn...
5 B- 70 – 74% Approaching Good
6 C+ 67 – 69% Exceeding Satisfactory
7 C 64 – 66% Satisfactory (minimal pass) Basic understandin...
8 C- 60 – 63% Approaching Satisfactory Receipt of a C- or le...
9 D+ 55 – 59% Marginal Performance
10 D 50 – 54% Minimal Performance
11 F 0 – 49% Fail
12 AU NaN Course Audit No course credit. Permission to A...
13 CR NaN Completed Requirements
14 NC NaN Not Complete The NC grade is assigned to stude...
15 AT NaN Attended
16 NS NaN Not Subject to Grading

In this example:

  1. Reading HTML Tables: The pd.read_html method extracts tables from a URL into a list of DataFrames.

  2. Displaying the DataFrame: The display function shows the DataFrame.

  3. Saving to an HTML File: The to_html method saves the DataFrame to an HTML file.

For more details, you can refer to the Pandas HTML documentation.

5.1.3.7. XML#

XML (eXtensible Markup Language) is a markup language that defines a set of rules for encoding documents. Pandas can read from and write to XML files, which are commonly used for data interchange and storage.

Why Use XML?

  • Structured Data: XML is designed to store and transport data, with a focus on what data is.

  • Flexibility: XML can represent complex data structures.

import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Save to an XML file
df.to_xml('data.xml')

# Load from an XML file
df_loaded = pd.read_xml('data.xml')
display(df_loaded)
index A B
0 0 1 4
1 1 2 5
2 2 3 6

In this example:

  1. Creating a DataFrame: We create a simple DataFrame.

  2. Saving to an XML File: The to_xml method saves the DataFrame to a file named ‘data.xml’.

  3. Loading from an XML File: The pd.read_xml method reads the data from the XML file back into a DataFrame.

For more details, you can refer to the Pandas XML documentation.

5.1.3.8. LaTeX#

LaTeX is a typesetting system commonly used for technical and scientific documentation. Pandas can convert DataFrames to LaTeX format, which is useful for including tables in LaTeX documents.

Why Use LaTeX?

  • Professional Formatting: LaTeX produces high-quality typeset documents.

  • Scientific Use: Widely used in academia for papers, theses, and books.

Example:

import pandas as pd

# Create a DataFrame about Alberta universities
df = pd.DataFrame({'University': ['University of Calgary', 'University of Alberta', 'Mount Royal University'],
                   'City': ['Calgary', 'Edmonton', 'Calgary']})

# Convert to LaTeX format
latex_str = df.to_latex()
print(latex_str)
\begin{tabular}{lll}
\toprule
 & University & City \\
\midrule
0 & University of Calgary & Calgary \\
1 & University of Alberta & Edmonton \\
2 & Mount Royal University & Calgary \\
\bottomrule
\end{tabular}

In this example:

  1. Creating a DataFrame: We create a DataFrame with universities in Alberta and their cities.

  2. Converting to LaTeX Format: The to_latex method converts the DataFrame to a LaTeX formatted string.

For more details, you can refer to the Pandas LaTeX documentation.

5.1.3.9. Other Inputs/Outputs#

  • SQL

    • Database Integration: Standard language for relational databases.

    • Efficiency: Efficiently handles large datasets.

    • Pandas SQL documentation

  • HDFStore: PyTables (HDF5)

  • Feather

  • Parquet

  • ORC

    • Performance: High performance in big data environments.

    • Compression: Efficiently compresses data.

    • Pandas ORC documentation

  • SAS

    • Advanced Analytics: Widely used for statistical analysis.

    • Data Management: Efficiently handles large datasets.

    • Pandas SAS documentation

  • SPSS

    • Statistical Analysis: Designed for complex statistical analyses.

    • Ease of Use: User-friendly interface.

    • Pandas SPSS documentation

  • Google BigQuery

Table 5.2 summarizes the key inputs and outputs supported by Pandas for data manipulation and analysis.

Table 5.2 A summary table of the various Pandas inputs and outputs.#

Input/Output

Description

Example

Pickling

Serialization of DataFrames to byte streams.

df.to_pickle('file.pkl')

Flat File

Reading/writing CSV, TSV files.

df.to_csv('file.csv')

Clipboard

Copying data to/from clipboard.

df.to_clipboard()

Excel

Reading/writing Excel files.

df.to_excel('file.xlsx')

JSON

Reading/writing JSON files.

df.to_json('file.json')

HTML

Reading/writing HTML tables.

df.to_html('file.html')

XML

Reading/writing XML files.

df.to_xml('file.xml')

SQL

Reading/writing SQL databases.

pd.read_sql('query', conn)

HDFStore

Storing DataFrames in HDF5 format.

df.to_hdf('file.h5', 'key')

Feather

Fast binary file format for DataFrames.

df.to_feather('file.feather')

Parquet

Columnar storage file format.

df.to_parquet('file.parquet')

ORC

Optimized Row Columnar format.

df.to_orc('file.orc')

SAS

Reading SAS files.

pd.read_sas('file.sas7bdat')

SPSS

Reading SPSS files.

pd.read_spss('file.sav')

BigQuery

Reading/writing Google BigQuery.

pd.read_gbq('query')

STATA

Reading/writing STATA files.

pd.read_stata('file.dta')