6.1. An Introduction to Pandas#
Pandas is a Python library, open-source in nature, that offers robust data structures and tools for data analysis. It is specifically designed to simplify and streamline data manipulation and analysis tasks, making them more efficient and straightforward. The name “Pandas” is derived from “Panel Data,” which refers to multi-dimensional structured datasets commonly used in econometrics and finance [Pandas Developers, 2023]. Pandas comes equipped with a diverse set of functions and methods for data handling, including reading and writing data from various file formats like CSV, Excel, and SQL databases. It facilitates data cleaning, transformation, aggregation, and addressing missing values. Furthermore, Pandas supports merging and joining datasets and offers functionality for time series operations [Pandas Developers, 2023]. Widely adopted in data science, machine learning, finance, economics, and numerous other fields, Pandas is favored for its ability to simplify and expedite data processing and analysis workflows. Its clear and intuitive syntax makes it accessible to both novices and experienced data scientists [Pandas Developers, 2023].
>>> pip install pandas
Here is a brief summary of key Python Pandas features:
6.1.1. Data Structures:#
Series: The Series is akin to a one-dimensional labeled array, similar to a NumPy array but with an associated index. This index provides meaningful labels for each element in the Series, allowing for effortless data alignment and retrieval [Pandas Developers, 2023].
DataFrame: The DataFrame is a two-dimensional labeled data structure, similar to a spreadsheet or SQL table. It consists of rows and columns, where each column can accommodate various data types. DataFrames provide a versatile and potent method for working with structured data, enabling operations such as filtering, joining, grouping, and more [Pandas Developers, 2023].
Example - Series: A Pandas Series object can be instantiated through the implementation of the pd.Series constructor.
import pandas as pd
# Create a Pandas Series with custom index
data = pd.Series([10, 20, 30, 40], index=['A', 'B', 'C', 'D'])
# Print the Pandas Series
print("Pandas Series:")
print(data)
Pandas Series:
A 10
B 20
C 30
D 40
dtype: int64
Example - DataFrame: A Pandas DataFrame object can be created by utilizing the pd.DataFrame constructor.
import pandas as pd
# Create a DataFrame from a dictionary
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['Calgary', 'Edmonton', 'Red Deer']
}
df = pd.DataFrame(data)
# Display the DataFrame
print("DataFrame:")
display(df)
DataFrame:
Name | Age | City | |
---|---|---|---|
0 | Alice | 25 | Calgary |
1 | Bob | 30 | Edmonton |
2 | Charlie | 22 | Red Deer |
6.1.2. Reading and Writing Data:#
Pandas is a powerful Python library that provides data manipulation and analysis tools. It’s widely used for tasks like reading and writing data in various formats. Here’s how you can use Pandas to read and write data [Pandas Developers, 2023]:
6.1.2.1. Reading Data#
Pandas can read data from various file formats like CSV, Excel, SQL databases, and more. The most commonly used method is pandas.read_csv()
for reading CSV files.
import pandas as pd
Link = 'https://download.microsoft.com/download/4/C/8/4C830C0C-101F-4BF2-8FCB-32D9A8BA906A/Import_User_Sample_en.csv'
# Read a CSV file into a DataFrame
data = pd.read_csv(Link)
# Display the DataFrame
print("The DataFrame:")
display(data)
The DataFrame:
User Name | First Name | Last Name | Display Name | Job Title | Department | Office Number | Office Phone | Mobile Phone | Fax | Address | City | State or Province | ZIP or Postal Code | Country or Region | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | chris@contoso.com | Chris | Green | Chris Green | IT Manager | Information Technology | 123451 | 123-555-1211 | 123-555-6641 | 123-555-9821 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
1 | ben@contoso.com | Ben | Andrews | Ben Andrews | IT Manager | Information Technology | 123452 | 123-555-1212 | 123-555-6642 | 123-555-9822 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
2 | david@contoso.com | David | Longmuir | David Longmuir | IT Manager | Information Technology | 123453 | 123-555-1213 | 123-555-6643 | 123-555-9823 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
3 | cynthia@contoso.com | Cynthia | Carey | Cynthia Carey | IT Manager | Information Technology | 123454 | 123-555-1214 | 123-555-6644 | 123-555-9824 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
4 | melissa@contoso.com | Melissa | MacBeth | Melissa MacBeth | IT Manager | Information Technology | 123455 | 123-555-1215 | 123-555-6645 | 123-555-9825 | 1 Microsoft way | Redmond | Wa | 98052 | United States |
You can also read Excel files using pandas.read_excel()
:
import pandas as pd
# Read an Excel file into a DataFrame
# Here the excel file is hosted on the web
data_excel = pd.read_excel('https://go.microsoft.com/fwlink/?LinkID=521962', sheet_name='Sheet1')
# Display the first five rows of the DataFrame
print("First five rows of the DataFrame:")
display(data_excel.head())
# The default behavior of Pandas' `head()` method is to display the first 5 rows of a DataFrame.
First five rows of the DataFrame:
Segment | Country | Product | Discount Band | Units Sold | Manufacturing Price | Sale Price | Gross Sales | Discounts | Sales | COGS | Profit | Date | Month Number | Month Name | Year | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Government | Canada | Carretera | NaN | 1618.5 | 3 | 20 | 32370.0 | 0.0 | 32370.0 | 16185.0 | 16185.0 | 2014-01-01 | 1 | January | 2014 |
1 | Government | Germany | Carretera | NaN | 1321.0 | 3 | 20 | 26420.0 | 0.0 | 26420.0 | 13210.0 | 13210.0 | 2014-01-01 | 1 | January | 2014 |
2 | Midmarket | France | Carretera | NaN | 2178.0 | 3 | 15 | 32670.0 | 0.0 | 32670.0 | 21780.0 | 10890.0 | 2014-06-01 | 6 | June | 2014 |
3 | Midmarket | Germany | Carretera | NaN | 888.0 | 3 | 15 | 13320.0 | 0.0 | 13320.0 | 8880.0 | 4440.0 | 2014-06-01 | 6 | June | 2014 |
4 | Midmarket | Mexico | Carretera | NaN | 2470.0 | 3 | 15 | 37050.0 | 0.0 | 37050.0 | 24700.0 | 12350.0 | 2014-06-01 | 6 | June | 2014 |
6.1.2.2. Writing and Exporting Data#
Pandas provides a versatile set of tools for exporting data to a variety of formats. One of the frequently employed techniques is using the DataFrame.to_csv()
method, which facilitates the export of data to a CSV (Comma-Separated Values) file:
Example:
import pandas as pd
# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 22],
'City': ['Calgary', 'Edmonton', 'Red Deer']
}
df = pd.DataFrame(data)
# Write the DataFrame to a CSV file
csv_filename = 'data.csv'
df.to_csv(csv_filename, index=False)
# Print a message indicating that the data has been written
print(f"Data written to {csv_filename}")
Data written to data.csv
For writing to Excel files, you can use DataFrame.to_excel()
:
# Write DataFrame to an Excel file
df.to_excel('new_data.xlsx', index=False)
Pandas can also write data to various other formats like SQL databases, JSON, and more [Pandas Developers, 2023].
Additional Options:
When reading or writing data, Pandas offers various options to customize the behavior [Pandas Developers, 2023]:
You can specify delimiter, encoding, header presence, and more when reading CSV files using
read_csv()
.When writing CSV files, you can control the delimiter, whether to include headers, and more using the
to_csv()
method’s parameters.For Excel files, you can specify the sheet name using the
sheet_name
parameter withto_excel()
andread_excel()
.
Remember that Pandas provides extensive documentation with examples for these methods, so you can refer to the official documentation for more details: Pandas IO documentation
Keep in mind that Pandas is just one tool in the data manipulation and analysis ecosystem. Depending on your needs, other libraries like NumPy, Matplotlib/Seaborn for visualization, and scikit-learn for machine learning might be useful as well.
6.1.2.3. Pandas Basics#
Some basic Pandas commands along with their descriptions:
pd.DataFrame(data)
: Create a DataFrame from data like a dictionary, array, or list.data.info()
: Display basic information about the DataFrame, including data types and non-null counts.data.head(n)
: Display the firstn
rows of the DataFrame (default is 5).data.tail(n)
: Display the lastn
rows of the DataFrame (default is 5).data.describe()
: Display summary statistics of numerical columns (count, mean, std, min, max, quartiles).data.shape
: Returns the number of rows and columns in the DataFrame as a tuple.data.columns
: Access the column labels of the DataFrame.
Command |
Description |
---|---|
|
Create a DataFrame from data like a dictionary, array, or list. |
|
Display basic information about the DataFrame, including data types and non-null counts. |
|
Display the first n rows of the DataFrame (default is 5). |
|
Display the last n rows of the DataFrame (default is 5). |
|
Display summary statistics of numerical columns (count, mean, std, min, max, quartiles). |
|
Returns the number of rows and columns in the DataFrame as a tuple. |
|
Access the column labels of the DataFrame. |
Example:
import pandas as pd
import numpy as np
# Create the DataFrame
data = pd.DataFrame({'A': np.arange(0, 100),
'B': np.arange(1000, 900, -1)})
# Display basic DataFrame information
print("Displaying DataFrame Information:")
display(data)
# Get DataFrame information using data.info()
print("\nDataFrame Info:")
data.info()
# Display the first 10 rows
n = 10
print(f"\nDisplaying First {n} Rows:")
display(data.head(n))
# Display the last 10 rows
print(f"\nDisplaying Last {n} Rows:")
display(data.tail(n))
# Display summary statistics
print("\nSummary Statistics:")
display(data.describe())
# Get the shape of the DataFrame
print("\nDataFrame Shape:")
print("Number of rows and columns:", data.shape)
# Get column labels
print("\nColumn Labels:")
print("Column labels:", data.columns)
Displaying DataFrame Information:
A | B | |
---|---|---|
0 | 0 | 1000 |
1 | 1 | 999 |
2 | 2 | 998 |
3 | 3 | 997 |
4 | 4 | 996 |
... | ... | ... |
95 | 95 | 905 |
96 | 96 | 904 |
97 | 97 | 903 |
98 | 98 | 902 |
99 | 99 | 901 |
100 rows × 2 columns
DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 100 non-null int32
1 B 100 non-null int32
dtypes: int32(2)
memory usage: 932.0 bytes
Displaying First 10 Rows:
A | B | |
---|---|---|
0 | 0 | 1000 |
1 | 1 | 999 |
2 | 2 | 998 |
3 | 3 | 997 |
4 | 4 | 996 |
5 | 5 | 995 |
6 | 6 | 994 |
7 | 7 | 993 |
8 | 8 | 992 |
9 | 9 | 991 |
Displaying Last 10 Rows:
A | B | |
---|---|---|
90 | 90 | 910 |
91 | 91 | 909 |
92 | 92 | 908 |
93 | 93 | 907 |
94 | 94 | 906 |
95 | 95 | 905 |
96 | 96 | 904 |
97 | 97 | 903 |
98 | 98 | 902 |
99 | 99 | 901 |
Summary Statistics:
A | B | |
---|---|---|
count | 100.000000 | 100.000000 |
mean | 49.500000 | 950.500000 |
std | 29.011492 | 29.011492 |
min | 0.000000 | 901.000000 |
25% | 24.750000 | 925.750000 |
50% | 49.500000 | 950.500000 |
75% | 74.250000 | 975.250000 |
max | 99.000000 | 1000.000000 |
DataFrame Shape:
Number of rows and columns: (100, 2)
Column Labels:
Column labels: Index(['A', 'B'], dtype='object')
Note
By default, the head()
and tail()
functions display the first or last 5 rows of the DataFrame. However, you can specify a different number of rows to display by passing an argument to these functions. You can change the number of rows displayed by passing an integer argument to the head()
and tail()
functions. For example, data.head(10)
would display the first 10 rows of the DataFrame.
These are just a few basic commands in Pandas. The library offers a wide range of functions for data manipulation, exploration, and analysis. You can refer to the official Pandas documentation for more details and examples: Pandas Documentation.
6.1.3. Read html with Pandas#
pandas.read_html
is a handy function that comes with the Pandas library in Python. It makes it easy to pull tables out of web pages in HTML format and turn them into Pandas DataFrames. This is super useful for people who work with data because it lets them grab structured information from websites without a lot of hassle. If you want to dig into all the details and options, you can check out the full syntax here.
Examples:
# Import the Pandas library and alias it as 'pd'
import pandas as pd
# Use pd.read_html to extract tables from the specified URL
# [0] selects the first table on the webpage since read_html returns a list of DataFrames
Player_Stats = pd.read_html('https://www.premierleague.com/stats/top/players/goals?se=274')[0]
# Display the DataFrame containing player statistics
display(Player_Stats)
Rank | Player | Club | Nationality | Stat | Unnamed: 5 | |
---|---|---|---|---|---|---|
0 | 1.0 | Erling Haaland | Manchester City | Norway | 8 | NaN |
1 | 2.0 | Son Heung-Min | Tottenham Hotspur | South Korea | 6 | NaN |
2 | 3.0 | Jarrod Bowen | West Ham United | England | 5 | NaN |
3 | 4.0 | Odsonne Édouard | Crystal Palace | France | 4 | NaN |
4 | 4.0 | Evan Ferguson | Brighton & Hove Albion | Ireland | 4 | NaN |
5 | 4.0 | Hwang Hee-Chan | Wolverhampton Wanderers | South Korea | 4 | NaN |
6 | 4.0 | Alexander Isak | Newcastle United | Sweden | 4 | NaN |
7 | 4.0 | Bryan Mbeumo | Brentford | Cameroon | 4 | NaN |
8 | 4.0 | Bukayo Saka | Arsenal | England | 4 | NaN |
9 | 4.0 | Ollie Watkins | Aston Villa | England | 4 | NaN |
# Import the Pandas library and alias it as 'pd'
import pandas as pd
# Use pd.read_html to extract tables from the specified URL
# [0] selects the first table on the webpage since read_html returns a list of DataFrames
Grades_UofC = pd.read_html('https://conted.ucalgary.ca/info/grades.jsp')[0]
# Display the DataFrame containing grades information from the University of Calgary
display(Grades_UofC)
0 | 1 | 2 | |
---|---|---|---|
0 | A+ | 95 – 100% | Outstanding |
1 | A | 90 – 94% | Excellent Superior performance, showing compre... |
2 | A- | 85 – 89% | Approaching Excellent |
3 | B+ | 80 – 84% | Exceeding Good |
4 | B | 75 – 79% | Good Clearly above average performance with kn... |
5 | B- | 70 – 74% | Approaching Good |
6 | C+ | 67 – 69% | Exceeding Satisfactory |
7 | C | 64 – 66% | Satisfactory (minimal pass) Basic understandin... |
8 | C- | 60 – 63% | Approaching Satisfactory Receipt of a C- or le... |
9 | D+ | 55 – 59% | Marginal Performance |
10 | D | 50 – 54% | Minimal Performance |
11 | F | 0 – 49% | Fail |
12 | AU | NaN | Course Audit No course credit. Permission to A... |
13 | CR | NaN | Completed Requirements |
14 | NC | NaN | Not Complete The NC grade is assigned to stude... |
15 | AT | NaN | Attended |
16 | NS | NaN | Not Subject to Grading |
Example - Daily Weather Data Report for October 2023 (Calgary Int’l Airport, Alberta):
import pandas as pd
# Read the HTML data from the specified URL and store it as a list of DataFrames.
df_list = pd.read_html('https://climate.weather.gc.ca/climate_data/daily_data_e.html?StationID=50430')
# Select the first DataFrame from the list, which contains the climate data.
df = df_list[0]
# Display the DataFrame containing the climate data for analysis.
display(df)
DAY | Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Snow on Grnd Definition cm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 12.5 | 3.3 | 7.9 | 10.1 | 0.0 | 0.4 | 0.0 | 0.4 | NaN | 18 | 38 |
1 | 02 | 18.0 | 1.5 | 9.8 | 8.2 | 0.0 | LegendTT | 0.0 | LegendTT | NaN | 31 | 33 |
2 | 03 | 15.1 | 1.2 | 8.2 | 9.8 | 0.0 | 0.0 | 0.0 | 0.0 | NaN | LegendMM | LegendMM |
3 | 04 | 13.1 | 3.6 | 8.4 | 9.6 | 0.0 | LegendTT | 0.0 | LegendTT | NaN | 33 | 57 |
4 | 05 | 10.2 | 1.2 | 5.7 | 12.3 | 0.0 | LegendTT | 0.0 | LegendTT | NaN | 2 | 37 |
5 | Sum | NaN | NaN | NaN | 50.0LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | NaN | NaN | NaN |
6 | Avg | 13.8LegendCarer^ | 2.2LegendCarer^ | 8.0LegendCarer^ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | Xtrm | 18.0LegendCarer^ | 1.2LegendCarer^ | NaN | NaN | NaN | 0.4LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | NaN | 33LegendCarer^ | 57LegendCarer^ |
8 | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... |
Dropping Columns:
Dropping columns in Pandas is accomplished using the
drop
method, specifying the columns to be removed. The primary parameters for this operation are as follows:- `labels`: A list or single label representing the column(s) to be dropped. - `axis`: Specifies the axis along which the drop operation will occur. Use `axis=1` for columns. - `inplace`: An optional parameter, when set to `True`, modifies the original DataFrame; otherwise, it returns a new DataFrame with the specified columns removed. Here is a code example to drop a column named 'ColumnName': ```python df.drop(columns='ColumnName', inplace=True) ```
Please see this link for more details.
# Remove the column 'Snow on Grnd Definition cm' from the DataFrame.
df.drop(columns='Snow on Grnd Definition cm', inplace=True)
# Display the DataFrame after removing the specified column.
display(df)
DAY | Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 12.5 | 3.3 | 7.9 | 10.1 | 0.0 | 0.4 | 0.0 | 0.4 | 18 | 38 |
1 | 02 | 18.0 | 1.5 | 9.8 | 8.2 | 0.0 | LegendTT | 0.0 | LegendTT | 31 | 33 |
2 | 03 | 15.1 | 1.2 | 8.2 | 9.8 | 0.0 | 0.0 | 0.0 | 0.0 | LegendMM | LegendMM |
3 | 04 | 13.1 | 3.6 | 8.4 | 9.6 | 0.0 | LegendTT | 0.0 | LegendTT | 33 | 57 |
4 | 05 | 10.2 | 1.2 | 5.7 | 12.3 | 0.0 | LegendTT | 0.0 | LegendTT | 2 | 37 |
5 | Sum | NaN | NaN | NaN | 50.0LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | NaN | NaN |
6 | Avg | 13.8LegendCarer^ | 2.2LegendCarer^ | 8.0LegendCarer^ | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | Xtrm | 18.0LegendCarer^ | 1.2LegendCarer^ | NaN | NaN | NaN | 0.4LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | 33LegendCarer^ | 57LegendCarer^ |
8 | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... |
Dropping Rows:
Similarly, dropping rows in Pandas is achieved using the
drop
method, with the primary parameters as follows:labels
: A list or single label representing the row(s) to be dropped.axis
: Useaxis=0
(which is the default) to drop rows. You don’t need to specifyaxis
explicitly for dropping rows.inplace
: As before, setting this toTrue
will modify the original DataFrame.
To drop rows by index, you can use the following code as an example:
df.drop(index=[0, 1, 2], inplace=True)
This code will remove the rows with index 0, 1, and 2 from the DataFrame ‘df’.
Please see this link for more details.
# Remove the row with index 7 from the DataFrame.
df.drop(df.index[-1], inplace=True)
# Display the DataFrame after removing the specified row.
display(df)
DAY | Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 12.5 | 3.3 | 7.9 | 10.1 | 0.0 | 0.4 | 0.0 | 0.4 | 18 | 38 |
1 | 02 | 18.0 | 1.5 | 9.8 | 8.2 | 0.0 | LegendTT | 0.0 | LegendTT | 31 | 33 |
2 | 03 | 15.1 | 1.2 | 8.2 | 9.8 | 0.0 | 0.0 | 0.0 | 0.0 | LegendMM | LegendMM |
3 | 04 | 13.1 | 3.6 | 8.4 | 9.6 | 0.0 | LegendTT | 0.0 | LegendTT | 33 | 57 |
4 | 05 | 10.2 | 1.2 | 5.7 | 12.3 | 0.0 | LegendTT | 0.0 | LegendTT | 2 | 37 |
5 | Sum | NaN | NaN | NaN | 50.0LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | NaN | NaN |
6 | Avg | 13.8LegendCarer^ | 2.2LegendCarer^ | 8.0LegendCarer^ | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | Xtrm | 18.0LegendCarer^ | 1.2LegendCarer^ | NaN | NaN | NaN | 0.4LegendCarer^ | 0.0LegendCarer^ | 0.4LegendCarer^ | 33LegendCarer^ | 57LegendCarer^ |
In this code:
df
is your DataFrame.df.index[-1]
retrieves the index label of the last row in the DataFrame.drop
is used to remove the row with the specified index label.inplace=True
is used to modify the DataFrame in place.
Column Names: To access the column names in a Pandas DataFrame, you can use the
.columns
attribute. Here’s how to do it:# Assuming df is your DataFrame column_names = df.columns
The
df.columns
attribute returns an Index object containing the names of all the columns in the DataFrame. You can then convert this Index object to a list if needed:column_names_list = list(column_names)
Please see this link for more details.
# Retrieve and display the column names of the DataFrame.
df.columns
Index(['DAY', 'Max Temp Definition °C', 'Min Temp Definition °C',
'Mean Temp Definition °C', 'Heat Deg Days Definition',
'Cool Deg Days Definition', 'Total Rain Definition mm',
'Total Snow Definition cm', 'Total Precip Definition mm',
'Dir of Max Gust Definition 10's deg',
'Spd of Max Gust Definition km/h'],
dtype='object')
Replace each occurrence of pattern/regex (Optional Content):
The
.str.replace
method in Pandas is used to perform string replacement within each element of a Pandas Series or DataFrame column containing string values. It allows for the search and substitution of specific substrings or patterns in strings.Here are the primary parameters of the
.str.replace
method:pat
: The pattern or substring to search for.repl
: The replacement string or value.regex
: An optional boolean parameter (default isFalse
) that specifies whether the pattern provided is a regular expression.n
: An optional parameter to control the maximum number of replacements. If not specified, it replaces all occurrences by default.
Please see this link for more details.
# Iterate through each column in the DataFrame.
for Col in df.columns:
# Replace 'LegendCarer^' with an empty string in the current column.
df[Col] = df[Col].str.replace('LegendCarer^', '')
# Replace 'LegendT' with an empty string in the current column.
df[Col] = df[Col].str.replace('LegendT', '')
# Replace 'LegendM' with an empty string in the current column.
df[Col] = df[Col].str.replace('LegendM', '')
# Display the DataFrame after the specified replacements.
display(df)
DAY | Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 12.5 | 3.3 | 7.9 | 10.1 | 0.0 | 0.4 | 0.0 | 0.4 | 18 | 38 |
1 | 02 | 18.0 | 1.5 | 9.8 | 8.2 | 0.0 | T | 0.0 | T | 31 | 33 |
2 | 03 | 15.1 | 1.2 | 8.2 | 9.8 | 0.0 | 0.0 | 0.0 | 0.0 | M | M |
3 | 04 | 13.1 | 3.6 | 8.4 | 9.6 | 0.0 | T | 0.0 | T | 33 | 57 |
4 | 05 | 10.2 | 1.2 | 5.7 | 12.3 | 0.0 | T | 0.0 | T | 2 | 37 |
5 | Sum | NaN | NaN | NaN | 50.0 | 0.0 | 0.4 | 0.0 | 0.4 | NaN | NaN |
6 | Avg | 13.8 | 2.2 | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
7 | Xtrm | 18.0 | 1.2 | NaN | NaN | NaN | 0.4 | 0.0 | 0.4 | 33 | 57 |
Pandas
set_index
Method:The
set_index
method in Pandas is used to set one or more columns as the index of a DataFrame. When you set an index, it means you are designating one or more columns to serve as the row labels, providing a way to uniquely identify and access rows in the DataFrame.Here’s the primary parameter of the
set_index
method:keys
: This parameter specifies the column or columns you want to use as the new index.
df.set_index(keys, inplace=False)
df
: Refers to the Pandas DataFrame.keys
: The column or columns that you want to set as the new index. This can be a single column name or a list of column names.
# Set the 'DAY' column as the index for the DataFrame.
df.set_index('DAY', inplace=True)
# Display the DataFrame with 'DAY' as the new index.
display(df)
Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|
DAY | ||||||||||
01 | 12.5 | 3.3 | 7.9 | 10.1 | 0.0 | 0.4 | 0.0 | 0.4 | 18 | 38 |
02 | 18.0 | 1.5 | 9.8 | 8.2 | 0.0 | T | 0.0 | T | 31 | 33 |
03 | 15.1 | 1.2 | 8.2 | 9.8 | 0.0 | 0.0 | 0.0 | 0.0 | M | M |
04 | 13.1 | 3.6 | 8.4 | 9.6 | 0.0 | T | 0.0 | T | 33 | 57 |
05 | 10.2 | 1.2 | 5.7 | 12.3 | 0.0 | T | 0.0 | T | 2 | 37 |
Sum | NaN | NaN | NaN | 50.0 | 0.0 | 0.4 | 0.0 | 0.4 | NaN | NaN |
Avg | 13.8 | 2.2 | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Xtrm | 18.0 | 1.2 | NaN | NaN | NaN | 0.4 | 0.0 | 0.4 | 33 | 57 |
DataFrame.values: To access the underlying data as a NumPy array, you can do the following:
# Access the 'Max Temp Definition °C' column in the DataFrame.
df['Max Temp Definition °C']
DAY
01 12.5
02 18.0
03 15.1
04 13.1
05 10.2
Sum NaN
Avg 13.8
Xtrm 18.0
Name: Max Temp Definition °C, dtype: object
# Retrieve the values from the 'Max Temp Definition °C' column in the DataFrame.
df['Max Temp Definition °C'].values
array(['12.5', '18.0', '15.1', '13.1', '10.2', nan, '13.8', '18.0'],
dtype=object)
Renaming Columns and Indices: To rename columns and indices, you can use the
rename
method. Here’s an example:
# Import the pprint function for pretty printing.
from pprint import pprint
# Create a dictionary for renaming columns by removing 'Definition' from their names.
column_rename_dict = dict(zip(df.columns, [x.replace('Definition', '') for x in df.columns]))
# Print the title for the column renaming dictionary.
print("Column Renaming Dictionary:")
pprint(column_rename_dict)
# Rename the columns in the DataFrame using the created dictionary.
df.rename(columns=column_rename_dict, inplace=True)
# Rename specific index labels for clarity.
df.rename(index={'01': '1st', '02': '2nd', '03': '3rd', '04': '4th', '05': '5th'}, inplace=True)
# Print the title for the DataFrame with renamed columns and index labels.
print("\nDataFrame with Renamed Columns and Index Labels:")
display(df)
Column Renaming Dictionary:
{'Cool Deg Days Definition': 'Cool Deg Days ',
"Dir of Max Gust Definition 10's deg": "Dir of Max Gust 10's deg",
'Heat Deg Days Definition': 'Heat Deg Days ',
'Max Temp Definition °C': 'Max Temp °C',
'Mean Temp Definition °C': 'Mean Temp °C',
'Min Temp Definition °C': 'Min Temp °C',
'Spd of Max Gust Definition km/h': 'Spd of Max Gust km/h',
'Total Precip Definition mm': 'Total Precip mm',
'Total Rain Definition mm': 'Total Rain mm',
'Total Snow Definition cm': 'Total Snow cm'}
DataFrame with Renamed Columns and Index Labels:
Max Temp °C | Min Temp °C | Mean Temp °C | Heat Deg Days | Cool Deg Days | Total Rain mm | Total Snow cm | Total Precip mm | Dir of Max Gust 10's deg | Spd of Max Gust km/h | |
---|---|---|---|---|---|---|---|---|---|---|
DAY | ||||||||||
1st | 12.5 | 3.3 | 7.9 | 10.1 | 0.0 | 0.4 | 0.0 | 0.4 | 18 | 38 |
2nd | 18.0 | 1.5 | 9.8 | 8.2 | 0.0 | T | 0.0 | T | 31 | 33 |
3rd | 15.1 | 1.2 | 8.2 | 9.8 | 0.0 | 0.0 | 0.0 | 0.0 | M | M |
4th | 13.1 | 3.6 | 8.4 | 9.6 | 0.0 | T | 0.0 | T | 33 | 57 |
5th | 10.2 | 1.2 | 5.7 | 12.3 | 0.0 | T | 0.0 | T | 2 | 37 |
Sum | NaN | NaN | NaN | 50.0 | 0.0 | 0.4 | 0.0 | 0.4 | NaN | NaN |
Avg | 13.8 | 2.2 | 8.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Xtrm | 18.0 | 1.2 | NaN | NaN | NaN | 0.4 | 0.0 | 0.4 | 33 | 57 |