5.1. An Introduction to Pandas#
Pandas is an open-source Python library that offers robust data structures and tools for data analysis. It is designed to simplify and streamline data manipulation and analysis tasks. The name “Pandas” is derived from “Panel Data,” referring to multi-dimensional structured datasets commonly used in econometrics and finance [Pandas Developers, 2023].
Pandas provides a diverse set of functions and methods for data handling, including reading and writing data from various file formats like CSV, Excel, and SQL databases. It facilitates data cleaning, transformation, aggregation, and addressing missing values. Additionally, Pandas supports merging and joining datasets and offers functionality for time series operations [Pandas Developers, 2023].
Widely adopted in data science, machine learning, finance, economics, and numerous other fields, Pandas is favored for its ability to simplify and expedite data processing and analysis workflows. Its clear and intuitive syntax makes it accessible to both novices and experienced data scientists [Pandas Developers, 2023].
>>> pip install pandas
5.1.1. Data Structures:#
Series: The Series is akin to a one-dimensional labeled array, similar to a NumPy array but with an associated index. This index provides meaningful labels for each element in the Series, allowing for effortless data alignment and retrieval [Pandas Developers, 2023].
DataFrame: The DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can accommodate various data types. DataFrames provide a versatile and potent method for working with structured data, enabling operations such as filtering, joining, grouping, and more [Pandas Developers, 2023].
Example - Series: A Pandas Series object can be instantiated through the implementation of the pd.Series constructor.
import pandas as pd
# Create a Pandas Series with custom index
data = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])
# Print the Pandas Series
print("Pandas Series:")
print(data)
Pandas Series:
A 10
B 20
C 30
D 40
dtype: int64
Example - DataFrame: A Pandas DataFrame object can be created by utilizing the pd.DataFrame constructor.
import pandas as pd
# Create a DataFrame from a dictionary
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['Calgary', 'Edmonton', 'Red Deer']
}
df = pd.DataFrame(data)
# Display the DataFrame
print("DataFrame:")
display(df)
DataFrame:
Name | Age | City | |
---|---|---|---|
0 | Alice | 25 | Calgary |
1 | Bob | 30 | Edmonton |
2 | Charlie | 22 | Red Deer |
5.1.2. Baisc Pandas DataFrame Commands#
Understanding basic Pandas commands is essential for efficient data manipulation and analysis. Below is a summary of common DataFrame commands and their descriptions:
pd.DataFrame(data)
: Creates a DataFrame from data such as a dictionary, array, or list.data.info()
: Displays basic information about the DataFrame, including data types and non-null counts.data.head(n)
: Displays the firstn
rows of the DataFrame (default is 5).data.tail(n)
: Displays the lastn
rows of the DataFrame (default is 5).data.describe()
: Displays summary statistics of numerical columns (count, mean, std, min, max, quartiles).data.shape
: Returns the number of rows and columns in the DataFrame as a tuple.data.columns
: Accesses the column labels of the DataFrame.
Command |
Description |
---|---|
|
Create a DataFrame from data like a dictionary, array, or list. |
|
Display basic information about the DataFrame, including data types and non-null counts. |
|
Display the first n rows of the DataFrame (default is 5). |
|
Display the last n rows of the DataFrame (default is 5). |
|
Display summary statistics of numerical columns (count, mean, std, min, max, quartiles). |
|
Returns the number of rows and columns in the DataFrame as a tuple. |
|
Access the column labels of the DataFrame. |
Example:
import pandas as pd
import numpy as np
# Create the DataFrame
data = pd.DataFrame({'A': np.arange(0, 100),
'B': np.arange(1000, 900, -1)})
# Display basic DataFrame information
print("Displaying DataFrame Information:")
display(data)
# Get DataFrame information using data.info()
print("\nDataFrame Info:")
data.info()
# Display the first 10 rows
n = 10
print(f"\nDisplaying First {n} Rows:")
display(data.head(n))
# Display the last 10 rows
print(f"\nDisplaying Last {n} Rows:")
display(data.tail(n))
# Display summary statistics
print("\nSummary Statistics:")
display(data.describe())
# Get the shape of the DataFrame
print("\nDataFrame Shape:")
print("Number of rows and columns:", data.shape)
# Get column labels
print("\nColumn Labels:")
print("Column labels:", data.columns)
Displaying DataFrame Information:
A | B | |
---|---|---|
0 | 0 | 1000 |
1 | 1 | 999 |
2 | 2 | 998 |
3 | 3 | 997 |
4 | 4 | 996 |
... | ... | ... |
95 | 95 | 905 |
96 | 96 | 904 |
97 | 97 | 903 |
98 | 98 | 902 |
99 | 99 | 901 |
100 rows × 2 columns
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 100 non-null int32
1 B 100 non-null int32
dtypes: int32(2)
memory usage: 932.0 bytes
Displaying First 10 Rows:
A | B | |
---|---|---|
0 | 0 | 1000 |
1 | 1 | 999 |
2 | 2 | 998 |
3 | 3 | 997 |
4 | 4 | 996 |
5 | 5 | 995 |
6 | 6 | 994 |
7 | 7 | 993 |
8 | 8 | 992 |
9 | 9 | 991 |
Displaying Last 10 Rows:
A | B | |
---|---|---|
90 | 90 | 910 |
91 | 91 | 909 |
92 | 92 | 908 |
93 | 93 | 907 |
94 | 94 | 906 |
95 | 95 | 905 |
96 | 96 | 904 |
97 | 97 | 903 |
98 | 98 | 902 |
99 | 99 | 901 |
Summary Statistics:
A | B | |
---|---|---|
count | 100.000000 | 100.000000 |
mean | 49.500000 | 950.500000 |
std | 29.011492 | 29.011492 |
min | 0.000000 | 901.000000 |
25% | 24.750000 | 925.750000 |
50% | 49.500000 | 950.500000 |
75% | 74.250000 | 975.250000 |
max | 99.000000 | 1000.000000 |
DataFrame Shape:
Number of rows and columns: (100, 2)
Column Labels:
Column labels: Index(['A', 'B'], dtype='object')
Note
By default, the head()
and tail()
functions display the first or last 5 rows of the DataFrame. However, you can specify a different number of rows to display by passing an argument to these functions. You can change the number of rows displayed by passing an integer argument to the head()
and tail()
functions. For example, data.head(10)
would display the first 10 rows of the DataFrame.
These are just a few basic commands in Pandas. The library offers a wide range of functions for data manipulation, exploration, and analysis. You can refer to the official Pandas documentation for more details and examples: Pandas Documentation.
5.1.3. Pandas Inputs/Outputs#
Pandas is a powerful Python library that provides data manipulation and analysis tools. It can read data from various file formats like CSV, Excel, SQL databases, and more [Pandas Developers, 2023].
5.1.3.1. Pickling#
Pickling is a way to serialize (convert to a byte stream) and deserialize (convert back to an object) Python objects. This is useful for saving complex data structures like DataFrames. Serialization allows you to save the state of an object to a file, which can be loaded back into memory later, preserving the object’s state.
Why Use Pickling?
Efficiency: Pickling is faster than other serialization formats like CSV or JSON because it directly converts Python objects into byte streams.
Convenience: It preserves the data types and structure of the DataFrame, making it easy to reload without additional parsing or type conversion.
Example:
import pandas as pd
# Create a DataFrame about Calgary temperatures
df = pd.DataFrame({'Month': ['Jan', 'Feb', 'Mar'], 'Temperature': [-1, 0, 5]})
# Save to a pickle file
df.to_pickle('calgary_temps.pkl')
# Load from a pickle file
df_loaded = pd.read_pickle('calgary_temps.pkl')
display(df_loaded)
Month | Temperature | |
---|---|---|
0 | Jan | -1 |
1 | Feb | 0 |
2 | Mar | 5 |
In this example:
Creating a DataFrame: We create a simple DataFrame with months and corresponding temperatures.
Saving to a Pickle File: The
to_pickle
method serializes the DataFrame and saves it to a file named ‘calgary_temps.pkl’.Loading from a Pickle File: The
pd.read_pickle
method deserializes the byte stream from the file back into a DataFrame.
5.1.3.2. Flat File#
Flat files are simple text files that store data in a plain text format. Common formats include CSV and TSV. These files are easy to read and write, making them a popular choice for data storage and transfer [Pandas Developers, 2023].
Why Use Flat Files?
Simplicity: Flat files are straightforward to create and manage.
Compatibility: They can be easily shared and opened with various software, including text editors and spreadsheet programs.
Efficiency: For small to medium-sized datasets, flat files are efficient and quick to process.
Example:
The data here is sourced from citypopulation.de.
import pandas as pd
# Create a DataFrame about Alberta population
df = pd.DataFrame({'City': ['Calgary', 'Edmonton', 'Red Deer'], 'Population': [1481466, 1128811, 109234]})
# Save to a CSV file
df.to_csv('alberta_population.csv', index=False)
# Load from a CSV file
df_loaded = pd.read_csv('alberta_population.csv')
display(df)
City | Population | |
---|---|---|
0 | Calgary | 1481466 |
1 | Edmonton | 1128811 |
2 | Red Deer | 109234 |
In this example:
Creating a DataFrame: We create a DataFrame with cities in Alberta and their populations.
Saving to a CSV File: The
to_csv
method saves the DataFrame to a file named ‘alberta_population.csv’.
Additional Options
When reading or writing data, Pandas offers various options to customize the behavior:
Reading CSV Files: You can specify delimiter, encoding, header presence, and more using
read_csv()
.Writing CSV Files: You can control the delimiter, whether to include headers, and more using the
to_csv()
method’s parameters.Excel Files: For Excel files, you can specify the sheet name using the
sheet_name
parameter withto_excel()
andread_excel()
.
For more details, refer to the Pandas IO documentation.
Keep in mind that Pandas is just one tool in the data manipulation and analysis ecosystem. Depending on your needs, other libraries like NumPy, Matplotlib/Seaborn for visualization, and scikit-learn for machine learning might be useful as well.
Example:
import pandas as pd
Link = 'https://download.microsoft.com/download/4/C/8/4C830C0C-101F-4BF2-8FCB-32D9A8BA906A/Import_User_Sample_en.csv'
# Read a CSV file into a DataFrame
data = pd.read_csv(Link)
# Display the DataFrame
print("The DataFrame:")
display(data)
The DataFrame:
User Name | First Name | Last Name | Display Name | Job Title | Department | Office Number | Office Phone | Mobile Phone | Fax | Address | City | State or Province | ZIP or Postal Code | Country or Region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chris@contoso.com | Chris | Green | Chris Green | IT Manager | Information Technology | 123451 | 123-555-1211 | 123-555-6641 | 123-555-9821 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
1 | ben@contoso.com | Ben | Andrews | Ben Andrews | IT Manager | Information Technology | 123452 | 123-555-1212 | 123-555-6642 | 123-555-9822 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
2 | david@contoso.com | David | Longmuir | David Longmuir | IT Manager | Information Technology | 123453 | 123-555-1213 | 123-555-6643 | 123-555-9823 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
3 | cynthia@contoso.com | Cynthia | Carey | Cynthia Carey | IT Manager | Information Technology | 123454 | 123-555-1214 | 123-555-6644 | 123-555-9824 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
4 | melissa@contoso.com | Melissa | MacBeth | Melissa MacBeth | IT Manager | Information Technology | 123455 | 123-555-1215 | 123-555-6645 | 123-555-9825 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
In this example:
Reading a CSV File: We read a CSV file from a URL into a DataFrame using the
pd.read_csv
method.Displaying the DataFrame: The
display
function is used to show the DataFrame.
5.1.3.3. Clipboard#
You can copy data to and from the system clipboard using pandas. This is useful for quickly transferring data between applications [Pandas Developers, 2023].
Why Use Clipboard Operations?
Convenience: Easily transfer data between different applications without saving to a file.
Speed: Quickly copy and paste data for immediate use in other programs like spreadsheets or text editors.
Example:
import pandas as pd
# Create a DataFrame about Canadian provinces
df = pd.DataFrame({'Province': ['Alberta', 'British Columbia', 'Ontario'], 'Capital': ['Edmonton', 'Victoria', 'Toronto']})
# Copy to clipboard
df.to_clipboard(index=False)
display(df)
Province | Capital | |
---|---|---|
0 | Alberta | Edmonton |
1 | British Columbia | Victoria |
2 | Ontario | Toronto |
In this example:
Creating a DataFrame: We create a DataFrame with Canadian provinces and their capitals.
Copying to Clipboard: The
to_clipboard
method copies the DataFrame to the system clipboard, making it available for pasting into other applications.
5.1.3.4. Excel#
Pandas can read from and write to Excel files using the read_excel
and to_excel
functions. This functionality is particularly useful for data analysis and reporting, as Excel is a widely used format for data storage and exchange [Pandas Developers, 2023].
Why Use Excel with Pandas?
Versatility: Excel files can store multiple sheets, making it easy to organize and manage different datasets within a single file.
Compatibility: Excel is a common format in many industries, making it easy to share data with others who may not use Python.
Functionality: Pandas provides extensive options for reading and writing Excel files, allowing for customization and flexibility.
Example:
import pandas as pd
# Create a DataFrame about Calgary sports teams
df = pd.DataFrame({'Team': ['Flames', 'Stampeders', 'Roughnecks'], 'Sport': ['Hockey', 'Football', 'Lacrosse']})
# Save to an Excel file
df.to_excel('calgary_sports_teams.xlsx', index=False)
display(df)
Team | Sport | |
---|---|---|
0 | Flames | Hockey |
1 | Stampeders | Football |
2 | Roughnecks | Lacrosse |
In this example:
Creating a DataFrame: We create a DataFrame with Calgary sports teams and their respective sports.
Saving to an Excel File: The
to_excel
method saves the DataFrame to a file named ‘calgary_sports_teams.xlsx’.
Example:
import pandas as pd
link_to_file = r'https://open.alberta.ca/dataset/c45f2f78-0d6c-4a7e-98bc-313fbb232040/resource/1b346528-5188-47ee-8fef-b3d41e3db7c0/download/population-estimates-ab-quarterly-1951-to-current.xlsx'
# Load from an Excel file
df_loaded = pd.read_excel(link_to_file, skiprows=2, nrows=74)
display(df_loaded)
Year | January 1 | April 1 | July 1 | October 1 | |
---|---|---|---|---|---|
0 | 1951 | -- | -- | 943000.0 | 950000.0 |
1 | 1952 | 959000 | 965000 | 977000.0 | 985000.0 |
2 | 1953 | 994000 | 1002000 | 1016000.0 | 1029000.0 |
3 | 1954 | 1039000 | 1047000 | 1061000.0 | 1070000.0 |
4 | 1955 | 1078000 | 1084000 | 1094000.0 | 1102000.0 |
... | ... | ... | ... | ... | ... |
69 | 2020 | 4392958 | 4405455 | 4407495.0 | 4412013.0 |
70 | 2021 | 4418338 | 4425617 | 4431531.0 | 4450442.0 |
71 | 2022 | 4465537 | 4480956 | 4510891.0 | 4561350.0 |
72 | 2023 | 4598444 | 4645229 | 4695290.0 | 4756408.0 |
73 | 2024 | 4800768 | 4849906 | NaN | NaN |
74 rows × 5 columns
In this example:
Reading an Excel File: We read data from an Excel file located at a URL into a DataFrame using the
pd.read_excel
method.Customizing the Read Operation: The
skiprows
parameter skips the first two rows, and thenrows
parameter limits the read to 74 rows.
For more details, you can refer to the Pandas IO documentation and the specific methods read_excel and to_excel.
5.1.3.5. JSON#
JSON (JavaScript Object Notation) is a lightweight data interchange format. Pandas can read from and write to JSON files. JSON is commonly used for transmitting data in web applications [Pandas Developers, 2023].
Why Use JSON?
Human-Readable: JSON is easy to read and write for humans.
Interoperable: JSON is language-independent, making it a great choice for data interchange between different systems.
Example: The numbers are fictional.
import pandas as pd
# Create a DataFrame about Alberta parks
df = pd.DataFrame({'Park': ['Banff', 'Jasper', 'Elk Island'], 'Visitors': [4000000, 2000000, 500000]})
# Save to a JSON file
df.to_json('alberta_parks.json')
# Load from a JSON file
df_loaded = pd.read_json('alberta_parks.json')
display(df_loaded)
Park | Visitors | |
---|---|---|
0 | Banff | 4000000 |
1 | Jasper | 2000000 |
2 | Elk Island | 500000 |
In this example:
Creating a DataFrame: We create a DataFrame with parks in Alberta and their visitor numbers.
Saving to a JSON File: The
to_json
method saves the DataFrame to a file named ‘alberta_parks.json’.Loading from a JSON File: The
pd.read_json
method reads the data from the JSON file back into a DataFrame.
For more details, you can refer to the Pandas JSON documentation.
5.1.3.6. HTML#
Pandas can read tables from HTML files and write DataFrames to HTML. This is useful for web scraping and exporting data to be displayed on web pages.
Why Use HTML?
Web Integration: HTML is the standard language for creating web pages, making it easy to integrate data with web content.
Versatility: You can extract data from any HTML table on the web.
Example:
# Import the Pandas library and alias it as 'pd'
import pandas as pd
# Use pd.read_html to extract tables from the specified URL
# [0] selects the first table on the webpage since read_html returns a list of DataFrames
Grades_UofC = pd.read_html('https://conted.ucalgary.ca/info/grades.jsp')[0]
# Display the DataFrame containing grades information from the University of Calgary
display(Grades_UofC)
Grades_UofC.to_html('grades_uofc.html')
0 | 1 | 2 | |
---|---|---|---|
0 | A+ | 95 – 100% | Outstanding |
1 | A | 90 – 94% | Excellent Superior performance, showing compre... |
2 | A- | 85 – 89% | Approaching Excellent |
3 | B+ | 80 – 84% | Exceeding Good |
4 | B | 75 – 79% | Good Clearly above average performance with kn... |
5 | B- | 70 – 74% | Approaching Good |
6 | C+ | 67 – 69% | Exceeding Satisfactory |
7 | C | 64 – 66% | Satisfactory (minimal pass) Basic understandin... |
8 | C- | 60 – 63% | Approaching Satisfactory Receipt of a C- or le... |
9 | D+ | 55 – 59% | Marginal Performance |
10 | D | 50 – 54% | Minimal Performance |
11 | F | 0 – 49% | Fail |
12 | AU | NaN | Course Audit No course credit. Permission to A... |
13 | CR | NaN | Completed Requirements |
14 | NC | NaN | Not Complete The NC grade is assigned to stude... |
15 | AT | NaN | Attended |
16 | NS | NaN | Not Subject to Grading |
In this example:
Reading HTML Tables: The
pd.read_html
method extracts tables from a URL into a list of DataFrames.Displaying the DataFrame: The
display
function shows the DataFrame.Saving to an HTML File: The
to_html
method saves the DataFrame to an HTML file.
For more details, you can refer to the Pandas HTML documentation.
5.1.3.7. XML#
XML (eXtensible Markup Language) is a markup language that defines a set of rules for encoding documents. Pandas can read from and write to XML files, which are commonly used for data interchange and storage.
Why Use XML?
Structured Data: XML is designed to store and transport data, with a focus on what data is.
Flexibility: XML can represent complex data structures.
import pandas as pd
# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
# Save to an XML file
df.to_xml('data.xml')
# Load from an XML file
df_loaded = pd.read_xml('data.xml')
display(df_loaded)
index | A | B | |
---|---|---|---|
0 | 0 | 1 | 4 |
1 | 1 | 2 | 5 |
2 | 2 | 3 | 6 |
In this example:
Creating a DataFrame: We create a simple DataFrame.
Saving to an XML File: The
to_xml
method saves the DataFrame to a file named ‘data.xml’.Loading from an XML File: The
pd.read_xml
method reads the data from the XML file back into a DataFrame.
For more details, you can refer to the Pandas XML documentation.
5.1.3.8. LaTeX#
LaTeX is a typesetting system commonly used for technical and scientific documentation. Pandas can convert DataFrames to LaTeX format, which is useful for including tables in LaTeX documents.
Why Use LaTeX?
Professional Formatting: LaTeX produces high-quality typeset documents.
Scientific Use: Widely used in academia for papers, theses, and books.
Example:
import pandas as pd
# Create a DataFrame about Alberta universities
df = pd.DataFrame({'University': ['University of Calgary', 'University of Alberta', 'Mount Royal University'],
'City': ['Calgary', 'Edmonton', 'Calgary']})
# Convert to LaTeX format
latex_str = df.to_latex()
print(latex_str)
\begin{tabular}{lll}
\toprule
& University & City \\
\midrule
0 & University of Calgary & Calgary \\
1 & University of Alberta & Edmonton \\
2 & Mount Royal University & Calgary \\
\bottomrule
\end{tabular}
In this example:
Creating a DataFrame: We create a DataFrame with universities in Alberta and their cities.
Converting to LaTeX Format: The
to_latex
method converts the DataFrame to a LaTeX formatted string.
For more details, you can refer to the Pandas LaTeX documentation.
5.1.3.9. Other Inputs/Outputs#
SQL
Database Integration: Standard language for relational databases.
Efficiency: Efficiently handles large datasets.
HDFStore: PyTables (HDF5)
High Performance: Optimized for fast I/O operations.
Scalability: Suitable for large datasets.
Feather
Speed: Designed for fast read and write operations.
Interoperability: Usable with both Python and R.
Parquet
Efficiency: Columnar storage is efficient for analytical queries.
Compression: Highly compressed files.
ORC
Performance: High performance in big data environments.
Compression: Efficiently compresses data.
SAS
Advanced Analytics: Widely used for statistical analysis.
Data Management: Efficiently handles large datasets.
SPSS
Statistical Analysis: Designed for complex statistical analyses.
Ease of Use: User-friendly interface.
Google BigQuery
Scalability: Handles large-scale data analysis.
Serverless: Fully-managed data warehouse.
Table 5.2 summarizes the key inputs and outputs supported by Pandas for data manipulation and analysis.
Input/Output |
Description |
Example |
---|---|---|
Pickling |
Serialization of DataFrames to byte streams. |
|
Flat File |
Reading/writing CSV, TSV files. |
|
Clipboard |
Copying data to/from clipboard. |
|
Excel |
Reading/writing Excel files. |
|
JSON |
Reading/writing JSON files. |
|
HTML |
Reading/writing HTML tables. |
|
XML |
Reading/writing XML files. |
|
SQL |
Reading/writing SQL databases. |
|
HDFStore |
Storing DataFrames in HDF5 format. |
|
Feather |
Fast binary file format for DataFrames. |
|
Parquet |
Columnar storage file format. |
|
ORC |
Optimized Row Columnar format. |
|
SAS |
Reading SAS files. |
|
SPSS |
Reading SPSS files. |
|
BigQuery |
Reading/writing Google BigQuery. |
|
STATA |
Reading/writing STATA files. |
|