5.4. Modifying a DataFrame in Pandas#
In data analysis, modifying a DataFrame is a fundamental operation. This section covers common methods for renaming columns, dropping columns or rows, and other essential modifications.
5.4.1. Renaming Columns#
Renaming columns in a DataFrame can be accomplished using the rename
method. This method allows for specifying a dictionary where keys are the current column names and values are the new column names. This is particularly useful when you need to make your DataFrame more readable or align it with a specific format [Pandas Developers, 2023].
Example: The data in this example is sourced from Alberta Regional Dashboard [Government of Alberta, 2024].
import pandas as pd
# Sample DataFrame with population data from Alberta cities
df = pd.DataFrame({'Municipality': ['Calgary', 'Edmonton', 'Red Deer', 'Strathcona County', 'Lethbridge', 'Airdrie', 'Wood Buffalo'],
'Population 2022': [1413800, 1087803, 105883, 104330, 104254, 80222, 74532]
})
print('Original:')
display(df)
# Renaming columns
df.rename(columns={'Municipality': 'City', 'Population 2022': 'Population'}, inplace=True)
print('Renamed:')
display(df)
Original:
Municipality | Population 2022 | |
---|---|---|
0 | Calgary | 1413800 |
1 | Edmonton | 1087803 |
2 | Red Deer | 105883 |
3 | Strathcona County | 104330 |
4 | Lethbridge | 104254 |
5 | Airdrie | 80222 |
6 | Wood Buffalo | 74532 |
Renamed:
City | Population | |
---|---|---|
0 | Calgary | 1413800 |
1 | Edmonton | 1087803 |
2 | Red Deer | 105883 |
3 | Strathcona County | 104330 |
4 | Lethbridge | 104254 |
5 | Airdrie | 80222 |
6 | Wood Buffalo | 74532 |
The inplace=True
argument modifies the DataFrame directly without creating a copy.
5.4.2. Renaming using Axis-Style Parameters#
The rename
method also supports using axis-style parameters to rename columns and index labels in a DataFrame. Here are examples using the population data from Alberta cities.
Renaming Columns with a Function: You can use a function to rename columns dynamically. For instance, converting column names to lowercase:
# Sample DataFrame with population data from Alberta cities
df = pd.DataFrame({'Municipality': ['Calgary', 'Edmonton', 'Red Deer', 'Strathcona County', 'Lethbridge', 'Airdrie', 'Wood Buffalo'],
'Population 2022': [1413800, 1087803, 105883, 104330, 104254, 80222, 74532]
})
print('Original:')
display(df)
# Renaming columns using a function
df.rename(str.lower, axis='columns', inplace=True)
print('Renamed:')
display(df)
Original:
Municipality | Population 2022 | |
---|---|---|
0 | Calgary | 1413800 |
1 | Edmonton | 1087803 |
2 | Red Deer | 105883 |
3 | Strathcona County | 104330 |
4 | Lethbridge | 104254 |
5 | Airdrie | 80222 |
6 | Wood Buffalo | 74532 |
Renamed:
municipality | population 2022 | |
---|---|---|
0 | Calgary | 1413800 |
1 | Edmonton | 1087803 |
2 | Red Deer | 105883 |
3 | Strathcona County | 104330 |
4 | Lethbridge | 104254 |
5 | Airdrie | 80222 |
6 | Wood Buffalo | 74532 |
Renaming Index Labels: You can also rename index labels by specifying a dictionary where keys are the current index labels and values are the new index labels:
# Sample DataFrame with population data from Alberta cities
df = pd.DataFrame({'Municipality': ['Calgary', 'Edmonton', 'Red Deer', 'Strathcona County', 'Lethbridge', 'Airdrie', 'Wood Buffalo'],
'Population 2022': [1413800, 1087803, 105883, 104330, 104254, 80222, 74532]},
index=[1, 2, 3, 4, 5, 6, 7])
print('Original:')
display(df)
# Renaming index labels
df.rename({1: 101, 2: 102, 3: 103, 4: 104, 5: 105, 6: 106, 7: 107}, axis='index', inplace=True)
print('Renamed:')
display(df)
Original:
Municipality | Population 2022 | |
---|---|---|
1 | Calgary | 1413800 |
2 | Edmonton | 1087803 |
3 | Red Deer | 105883 |
4 | Strathcona County | 104330 |
5 | Lethbridge | 104254 |
6 | Airdrie | 80222 |
7 | Wood Buffalo | 74532 |
Renamed:
Municipality | Population 2022 | |
---|---|---|
101 | Calgary | 1413800 |
102 | Edmonton | 1087803 |
103 | Red Deer | 105883 |
104 | Strathcona County | 104330 |
105 | Lethbridge | 104254 |
106 | Airdrie | 80222 |
107 | Wood Buffalo | 74532 |
These examples demonstrate different ways to rename columns and index labels, providing flexibility depending on the context and requirements of your DataFrame operations. For more details, you can refer to the Pandas documentation on renaming columns.
5.4.3. Dropping Columns or Rows#
The drop
method is used to remove columns or rows from a DataFrame. This can be useful for cleaning up your data by removing unnecessary or irrelevant information.
Example - Dropping Columns: We will use the following data on vehicle registrations in Alberta cities [Government of Alberta, 2024]:
import pandas as pd
# Sample DataFrame with vehicle registrations data
df = pd.DataFrame({'Municipality': ['Calgary', 'Edmonton', 'Strathcona County', 'Red Deer', 'Lethbridge', 'Airdrie', 'Wood Buffalo'],
'Vehicle Registrations 2020': [1007282, 725796, 79271, 75533, 73964, 56838, 55905],
'Population 2020': [1413800, 1087803, 104330, 105883, 104254, 80222, 74532] # Additional column for demonstration
})
print('Original:')
display(df)
# Dropping a single column
df.drop(columns=['Population 2020'], inplace=True)
print('After dropping one column:')
display(df)
# Dropping multiple columns
df.drop(columns=['Municipality', 'Vehicle Registrations 2020'], inplace=True)
print('After dropping multiple columns:')
display(df)
Original:
Municipality | Vehicle Registrations 2020 | Population 2020 | |
---|---|---|---|
0 | Calgary | 1007282 | 1413800 |
1 | Edmonton | 725796 | 1087803 |
2 | Strathcona County | 79271 | 104330 |
3 | Red Deer | 75533 | 105883 |
4 | Lethbridge | 73964 | 104254 |
5 | Airdrie | 56838 | 80222 |
6 | Wood Buffalo | 55905 | 74532 |
After dropping one column:
Municipality | Vehicle Registrations 2020 | |
---|---|---|
0 | Calgary | 1007282 |
1 | Edmonton | 725796 |
2 | Strathcona County | 79271 |
3 | Red Deer | 75533 |
4 | Lethbridge | 73964 |
5 | Airdrie | 56838 |
6 | Wood Buffalo | 55905 |
After dropping multiple columns:
0 |
---|
1 |
2 |
3 |
4 |
5 |
6 |
Example - Dropping Rows:
Rows can be dropped based on their index. This is useful when you need to remove specific entries from your dataset.
# Recreate the DataFrame for row dropping examples
df = pd.DataFrame({'Municipality': ['Calgary', 'Edmonton', 'Strathcona County', 'Red Deer', 'Lethbridge', 'Airdrie', 'Wood Buffalo'],
'Vehicle Registrations 2020': [1007282, 725796, 79271, 75533, 73964, 56838, 55905]},
index=[1, 2, 3, 4, 5, 6, 7])
print('Original:')
display(df)
# Dropping a single row
df.drop(index=1, inplace=True)
print('After dropping one row:')
display(df)
# Dropping multiple rows
df.drop(index=[2, 3], inplace=True)
print('After dropping multiple rows:')
display(df)
Original:
Municipality | Vehicle Registrations 2020 | |
---|---|---|
1 | Calgary | 1007282 |
2 | Edmonton | 725796 |
3 | Strathcona County | 79271 |
4 | Red Deer | 75533 |
5 | Lethbridge | 73964 |
6 | Airdrie | 56838 |
7 | Wood Buffalo | 55905 |
After dropping one row:
Municipality | Vehicle Registrations 2020 | |
---|---|---|
2 | Edmonton | 725796 |
3 | Strathcona County | 79271 |
4 | Red Deer | 75533 |
5 | Lethbridge | 73964 |
6 | Airdrie | 56838 |
7 | Wood Buffalo | 55905 |
After dropping multiple rows:
Municipality | Vehicle Registrations 2020 | |
---|---|---|
4 | Red Deer | 75533 |
5 | Lethbridge | 73964 |
6 | Airdrie | 56838 |
7 | Wood Buffalo | 55905 |
Example - Dropping Rows and Columns Simultaneously:
The drop
method also allows for simultaneously dropping both rows and columns by specifying both index
and columns
parameters.
# Recreate the DataFrame for combined dropping examples
df = pd.DataFrame({'Municipality': ['Calgary', 'Edmonton', 'Strathcona County', 'Red Deer', 'Lethbridge', 'Airdrie', 'Wood Buffalo'],
'Vehicle Registrations 2020': [1007282, 725796, 79271, 75533, 73964, 56838, 55905],
'Population 2020': [1413800, 1087803, 104330, 105883, 104254, 80222, 74532]}, index=[1, 2, 3, 4, 5, 6, 7])
print('Original:')
display(df)
# Dropping rows and columns simultaneously
df.drop(index=[4, 5], columns=['Population 2020'], inplace=True)
print('After dropping specific rows and a column:')
display(df)
Original:
Municipality | Vehicle Registrations 2020 | Population 2020 | |
---|---|---|---|
1 | Calgary | 1007282 | 1413800 |
2 | Edmonton | 725796 | 1087803 |
3 | Strathcona County | 79271 | 104330 |
4 | Red Deer | 75533 | 105883 |
5 | Lethbridge | 73964 | 104254 |
6 | Airdrie | 56838 | 80222 |
7 | Wood Buffalo | 55905 | 74532 |
After dropping specific rows and a column:
Municipality | Vehicle Registrations 2020 | |
---|---|---|
1 | Calgary | 1007282 |
2 | Edmonton | 725796 |
3 | Strathcona County | 79271 |
6 | Airdrie | 56838 |
7 | Wood Buffalo | 55905 |
Example - Dropping Rows Based on Condition:
You can also drop rows based on a condition using boolean indexing.
# Recreate the DataFrame for conditional dropping examples
df = pd.DataFrame({'Municipality': ['Calgary', 'Edmonton', 'Strathcona County', 'Red Deer', 'Lethbridge', 'Airdrie', 'Wood Buffalo'],
'Vehicle Registrations 2020': [1007282, 725796, 79271, 75533, 73964, 56838, 55905],
'Population 2020': [1413800, 1087803, 104330, 105883, 104254, 80222, 74532]},
index=[1, 2, 3, 4, 5, 6, 7])
print('Original:')
display(df)
# Dropping rows where Vehicle Registrations 2020 is less than 100,000
df = df[df['Vehicle Registrations 2020'] >= 100000]
print('After dropping rows based on condition:')
display(df)
Original:
Municipality | Vehicle Registrations 2020 | Population 2020 | |
---|---|---|---|
1 | Calgary | 1007282 | 1413800 |
2 | Edmonton | 725796 | 1087803 |
3 | Strathcona County | 79271 | 104330 |
4 | Red Deer | 75533 | 105883 |
5 | Lethbridge | 73964 | 104254 |
6 | Airdrie | 56838 | 80222 |
7 | Wood Buffalo | 55905 | 74532 |
After dropping rows based on condition:
Municipality | Vehicle Registrations 2020 | Population 2020 | |
---|---|---|---|
1 | Calgary | 1007282 | 1413800 |
2 | Edmonton | 725796 | 1087803 |
For more information, check out the Pandas documentation on dropping rows and columns.
5.4.4. Adding/Modifying Columns#
New columns can be added directly by assigning a new series to a DataFrame. Existing columns can also be modified similarly. This is useful for creating new features or updating existing ones based on your analysis needs
# Setting new values, creating rows if needed
df.loc[new_row_label] = new_data
df['new_column'] = new_data
Advantages:
Convenient for adding new rows or columns without explicitly modifying the DataFrame shape.
Enables quick DataFrame expansion without explicit resizing.
Disadvantages:
Can lead to the unintentional creation of new rows or columns if not used carefully.
Might not be suitable for cases where strict control over data insertion is required.
Example:
import pandas as pd
# This example is from
# https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html
# The Air Quality NO2 dataset if from
# http://dhhagan.github.io/py-openaq/index.html
df = pd.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/main/doc/data/air_quality_no2.csv')
print(r'Air Quality NO2:')
df.head(8)
Air Quality NO2:
datetime | station_antwerp | station_paris | station_london | |
---|---|---|---|---|
0 | 2019-05-07 02:00:00 | NaN | NaN | 23.0 |
1 | 2019-05-07 03:00:00 | 50.5 | 25.0 | 19.0 |
2 | 2019-05-07 04:00:00 | 45.0 | 27.7 | 19.0 |
3 | 2019-05-07 05:00:00 | NaN | 50.4 | 16.0 |
4 | 2019-05-07 06:00:00 | NaN | 61.9 | NaN |
5 | 2019-05-07 07:00:00 | NaN | 72.4 | 26.0 |
6 | 2019-05-07 08:00:00 | NaN | 77.7 | 32.0 |
7 | 2019-05-07 09:00:00 | NaN | 67.9 | 32.0 |
The goal is to express the \(NO_2\) concentration at the London station in milligrams per cubic meter (mg/m³). This conversion is achieved under the conditions of 25 degrees Celsius and 1013 hPa pressure, using the specific conversion factor of 1.882 (Further information can be found here).
df["london_mg_per_cubic"] = df["station_london"] * 1.882
df.head(8)
datetime | station_antwerp | station_paris | station_london | london_mg_per_cubic | |
---|---|---|---|---|---|
0 | 2019-05-07 02:00:00 | NaN | NaN | 23.0 | 43.286 |
1 | 2019-05-07 03:00:00 | 50.5 | 25.0 | 19.0 | 35.758 |
2 | 2019-05-07 04:00:00 | 45.0 | 27.7 | 19.0 | 35.758 |
3 | 2019-05-07 05:00:00 | NaN | 50.4 | 16.0 | 30.112 |
4 | 2019-05-07 06:00:00 | NaN | 61.9 | NaN | NaN |
5 | 2019-05-07 07:00:00 | NaN | 72.4 | 26.0 | 48.932 |
6 | 2019-05-07 08:00:00 | NaN | 77.7 | 32.0 | 60.224 |
7 | 2019-05-07 09:00:00 | NaN | 67.9 | 32.0 | 60.224 |
For more details on modifying DataFrames, you can refer to the Pandas documentation on DataFrame operations.
Example - Daily Weather Data Report for October 2023 (Calgary Int’l Airport, Alberta):
This example is meant to assist in summarizing the above.
import pandas as pd
# Read the HTML data from the specified URL and store it as a list of DataFrames.
df_list = pd.read_html('https://climate.weather.gc.ca/climate_data/daily_data_e.html?StationID=50430')
# Select the first DataFrame from the list, which contains the climate data.
df = df_list[0]
# Display the DataFrame containing the climate data for analysis.
display(df)
DAY | Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Snow on Grnd Definition cm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 21.7 | 12.3 | 17.0 | 1.0 | 0.0 | 1.6 | 0.0 | 1.6 | NaN | 32 | 64 |
1 | 02 | 20.8 | 10.0 | 15.4 | 2.6 | 0.0 | 19.8 | 0.0 | 19.8 | NaN | 2 | 53 |
2 | 03 | 17.2 | 10.3 | 13.8 | 4.2 | 0.0 | 5.2 | 0.0 | 5.2 | NaN | 32 | 54 |
3 | 04 | 21.4 | 10.2 | 15.8 | 2.2 | 0.0 | 18.4 | 0.0 | 18.4 | NaN | 36 | 46 |
4 | 05 | 23.6 | 9.6 | 16.6 | 1.4 | 0.0 | 0.4 | 0.0 | 0.4 | NaN | 12 | 36 |
5 | 06 | 23.3 | 13.6 | 18.5 | 0.0 | 0.5 | LegendTT | 0.0 | LegendTT | NaN | 1 | 38 |
6 | 07 | 26.9 | 11.6 | 19.3 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | NaN | 16 | 40 |
7 | 08 | 29.8 | 14.0 | 21.9 | 0.0 | 3.9 | 0.0 | 0.0 | 0.0 | NaN | 33 | 35 |
8 | 09 | 29.2 | 15.6 | 22.4 | 0.0 | 4.4 | 0.0 | 0.0 | 0.0 | NaN | 14 | 47 |
9 | 10 | 34.2 | 16.8 | 25.5 | 0.0 | 7.5 | 0.0 | 0.0 | 0.0 | NaN | 25 | 33 |
10 | 11 | LegendMM | LegendMM | LegendMM | LegendMM | LegendMM | 0.0 | 0.0 | 0.0 | NaN | 35 | 45 |
11 | 12 | 24.1 | 15.2 | 19.7 | 0.0 | 1.7 | 0.0 | 0.0 | 0.0 | NaN | 9 | 64 |
12 | 13 | 28.0 | 11.5 | 19.8 | 0.0 | 1.8 | 0.0 | 0.0 | 0.0 | NaN | 4 | 42 |
13 | 14 | 22.4 | 13.8 | 18.1 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | NaN | 4 | 40 |
14 | 15 | 28.4 | 13.0 | 20.7 | 0.0 | 2.7 | 0.0 | 0.0 | 0.0 | NaN | 3 | 42 |
15 | 16 | 29.1 | 16.6 | 22.9 | 0.0 | 4.9 | 0.0 | 0.0 | 0.0 | NaN | 17 | 38 |
16 | 17 | 31.6 | 16.3 | 24.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | NaN | 15 | 34 |
17 | 18 | 32.2 | 16.0 | 24.1 | 0.0 | 6.1 | 0.0 | 0.0 | 0.0 | NaN | 23 | 44 |
18 | 19 | 32.7 | 15.8 | 24.3 | 0.0 | 6.3 | 0.0 | 0.0 | 0.0 | NaN | 34 | 49 |
19 | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20 | Sum | NaN | NaN | NaN | 11.4LegendCarer^ | 47.2LegendCarer^ | 45.4LegendCarer^ | 0.0LegendCarer^ | 45.4LegendCarer^ | NaN | NaN | NaN |
21 | Avg | 26.5LegendCarer^ | 13.5LegendCarer^ | 20.0LegendCarer^ | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22 | Xtrm | 34.2LegendCarer^ | 9.6LegendCarer^ | NaN | NaN | NaN | 19.8LegendCarer^ | 0.0LegendCarer^ | 19.8LegendCarer^ | NaN | 9LegendCarer^ | 64LegendCarer^LegendSS |
23 | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... |
Dropping Columns:
Dropping columns in Pandas is accomplished using the
drop
method, specifying the columns to be removed. The primary parameters for this operation are as follows:- `labels`: A list or single label representing the column(s) to be dropped. - `axis`: Specifies the axis along which the drop operation will occur. Use `axis=1` for columns. - `inplace`: An optional parameter, when set to `True`, modifies the original DataFrame; otherwise, it returns a new DataFrame with the specified columns removed. Here is a code example to drop a column named 'ColumnName': ```python df.drop(columns='ColumnName', inplace=True) ```
Please see this link for more details.
# Remove the column 'Snow on Grnd Definition cm' from the DataFrame.
df.drop(columns='Snow on Grnd Definition cm', inplace=True)
# Display the DataFrame after removing the specified column.
display(df)
DAY | Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 21.7 | 12.3 | 17.0 | 1.0 | 0.0 | 1.6 | 0.0 | 1.6 | 32 | 64 |
1 | 02 | 20.8 | 10.0 | 15.4 | 2.6 | 0.0 | 19.8 | 0.0 | 19.8 | 2 | 53 |
2 | 03 | 17.2 | 10.3 | 13.8 | 4.2 | 0.0 | 5.2 | 0.0 | 5.2 | 32 | 54 |
3 | 04 | 21.4 | 10.2 | 15.8 | 2.2 | 0.0 | 18.4 | 0.0 | 18.4 | 36 | 46 |
4 | 05 | 23.6 | 9.6 | 16.6 | 1.4 | 0.0 | 0.4 | 0.0 | 0.4 | 12 | 36 |
5 | 06 | 23.3 | 13.6 | 18.5 | 0.0 | 0.5 | LegendTT | 0.0 | LegendTT | 1 | 38 |
6 | 07 | 26.9 | 11.6 | 19.3 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | 16 | 40 |
7 | 08 | 29.8 | 14.0 | 21.9 | 0.0 | 3.9 | 0.0 | 0.0 | 0.0 | 33 | 35 |
8 | 09 | 29.2 | 15.6 | 22.4 | 0.0 | 4.4 | 0.0 | 0.0 | 0.0 | 14 | 47 |
9 | 10 | 34.2 | 16.8 | 25.5 | 0.0 | 7.5 | 0.0 | 0.0 | 0.0 | 25 | 33 |
10 | 11 | LegendMM | LegendMM | LegendMM | LegendMM | LegendMM | 0.0 | 0.0 | 0.0 | 35 | 45 |
11 | 12 | 24.1 | 15.2 | 19.7 | 0.0 | 1.7 | 0.0 | 0.0 | 0.0 | 9 | 64 |
12 | 13 | 28.0 | 11.5 | 19.8 | 0.0 | 1.8 | 0.0 | 0.0 | 0.0 | 4 | 42 |
13 | 14 | 22.4 | 13.8 | 18.1 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 4 | 40 |
14 | 15 | 28.4 | 13.0 | 20.7 | 0.0 | 2.7 | 0.0 | 0.0 | 0.0 | 3 | 42 |
15 | 16 | 29.1 | 16.6 | 22.9 | 0.0 | 4.9 | 0.0 | 0.0 | 0.0 | 17 | 38 |
16 | 17 | 31.6 | 16.3 | 24.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 15 | 34 |
17 | 18 | 32.2 | 16.0 | 24.1 | 0.0 | 6.1 | 0.0 | 0.0 | 0.0 | 23 | 44 |
18 | 19 | 32.7 | 15.8 | 24.3 | 0.0 | 6.3 | 0.0 | 0.0 | 0.0 | 34 | 49 |
19 | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20 | Sum | NaN | NaN | NaN | 11.4LegendCarer^ | 47.2LegendCarer^ | 45.4LegendCarer^ | 0.0LegendCarer^ | 45.4LegendCarer^ | NaN | NaN |
21 | Avg | 26.5LegendCarer^ | 13.5LegendCarer^ | 20.0LegendCarer^ | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22 | Xtrm | 34.2LegendCarer^ | 9.6LegendCarer^ | NaN | NaN | NaN | 19.8LegendCarer^ | 0.0LegendCarer^ | 19.8LegendCarer^ | 9LegendCarer^ | 64LegendCarer^LegendSS |
23 | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... | Summary, average and extreme values are based ... |
Dropping Rows:
Similarly, dropping rows in Pandas is achieved using the
drop
method, with the primary parameters as follows:labels
: A list or single label representing the row(s) to be dropped.axis
: Useaxis=0
(which is the default) to drop rows. You don’t need to specifyaxis
explicitly for dropping rows.inplace
: As before, setting this toTrue
will modify the original DataFrame.
To drop rows by index, you can use the following code as an example:
df.drop(index=[0, 1, 2], inplace=True)
This code will remove the rows with index 0, 1, and 2 from the DataFrame ‘df’.
Please see this link for more details.
# Remove the row with index 7 from the DataFrame.
df.drop(df.index[-1], inplace=True)
# Display the DataFrame after removing the specified row.
display(df)
DAY | Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 21.7 | 12.3 | 17.0 | 1.0 | 0.0 | 1.6 | 0.0 | 1.6 | 32 | 64 |
1 | 02 | 20.8 | 10.0 | 15.4 | 2.6 | 0.0 | 19.8 | 0.0 | 19.8 | 2 | 53 |
2 | 03 | 17.2 | 10.3 | 13.8 | 4.2 | 0.0 | 5.2 | 0.0 | 5.2 | 32 | 54 |
3 | 04 | 21.4 | 10.2 | 15.8 | 2.2 | 0.0 | 18.4 | 0.0 | 18.4 | 36 | 46 |
4 | 05 | 23.6 | 9.6 | 16.6 | 1.4 | 0.0 | 0.4 | 0.0 | 0.4 | 12 | 36 |
5 | 06 | 23.3 | 13.6 | 18.5 | 0.0 | 0.5 | LegendTT | 0.0 | LegendTT | 1 | 38 |
6 | 07 | 26.9 | 11.6 | 19.3 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | 16 | 40 |
7 | 08 | 29.8 | 14.0 | 21.9 | 0.0 | 3.9 | 0.0 | 0.0 | 0.0 | 33 | 35 |
8 | 09 | 29.2 | 15.6 | 22.4 | 0.0 | 4.4 | 0.0 | 0.0 | 0.0 | 14 | 47 |
9 | 10 | 34.2 | 16.8 | 25.5 | 0.0 | 7.5 | 0.0 | 0.0 | 0.0 | 25 | 33 |
10 | 11 | LegendMM | LegendMM | LegendMM | LegendMM | LegendMM | 0.0 | 0.0 | 0.0 | 35 | 45 |
11 | 12 | 24.1 | 15.2 | 19.7 | 0.0 | 1.7 | 0.0 | 0.0 | 0.0 | 9 | 64 |
12 | 13 | 28.0 | 11.5 | 19.8 | 0.0 | 1.8 | 0.0 | 0.0 | 0.0 | 4 | 42 |
13 | 14 | 22.4 | 13.8 | 18.1 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 4 | 40 |
14 | 15 | 28.4 | 13.0 | 20.7 | 0.0 | 2.7 | 0.0 | 0.0 | 0.0 | 3 | 42 |
15 | 16 | 29.1 | 16.6 | 22.9 | 0.0 | 4.9 | 0.0 | 0.0 | 0.0 | 17 | 38 |
16 | 17 | 31.6 | 16.3 | 24.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 15 | 34 |
17 | 18 | 32.2 | 16.0 | 24.1 | 0.0 | 6.1 | 0.0 | 0.0 | 0.0 | 23 | 44 |
18 | 19 | 32.7 | 15.8 | 24.3 | 0.0 | 6.3 | 0.0 | 0.0 | 0.0 | 34 | 49 |
19 | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20 | Sum | NaN | NaN | NaN | 11.4LegendCarer^ | 47.2LegendCarer^ | 45.4LegendCarer^ | 0.0LegendCarer^ | 45.4LegendCarer^ | NaN | NaN |
21 | Avg | 26.5LegendCarer^ | 13.5LegendCarer^ | 20.0LegendCarer^ | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22 | Xtrm | 34.2LegendCarer^ | 9.6LegendCarer^ | NaN | NaN | NaN | 19.8LegendCarer^ | 0.0LegendCarer^ | 19.8LegendCarer^ | 9LegendCarer^ | 64LegendCarer^LegendSS |
In this code:
df
is your DataFrame.df.index[-1]
retrieves the index label of the last row in the DataFrame.drop
is used to remove the row with the specified index label.inplace=True
is used to modify the DataFrame in place.
Column Names: To access the column names in a Pandas DataFrame, you can use the
.columns
attribute. Here’s how to do it:# Assuming df is your DataFrame column_names = df.columns
The
df.columns
attribute returns an Index object containing the names of all the columns in the DataFrame. You can then convert this Index object to a list if needed:column_names_list = list(column_names)
Please see this link for more details.
# Retrieve and display the column names of the DataFrame.
df.columns
Index(['DAY', 'Max Temp Definition °C', 'Min Temp Definition °C',
'Mean Temp Definition °C', 'Heat Deg Days Definition',
'Cool Deg Days Definition', 'Total Rain Definition mm',
'Total Snow Definition cm', 'Total Precip Definition mm',
'Dir of Max Gust Definition 10's deg',
'Spd of Max Gust Definition km/h'],
dtype='object')
Replace each occurrence of pattern/regex (Optional Content):
The
.str.replace
method in Pandas is used to perform string replacement within each element of a Pandas Series or DataFrame column containing string values. It allows for the search and substitution of specific substrings or patterns in strings.Here are the primary parameters of the
.str.replace
method:pat
: The pattern or substring to search for.repl
: The replacement string or value.regex
: An optional boolean parameter (default isFalse
) that specifies whether the pattern provided is a regular expression.n
: An optional parameter to control the maximum number of replacements. If not specified, it replaces all occurrences by default.
Please see this link for more details.
# Iterate through each column in the DataFrame.
for Col in df.columns:
# Replace 'LegendCarer^' with an empty string in the current column.
df[Col] = df[Col].str.replace('LegendCarer^', '')
# Replace 'LegendT' with an empty string in the current column.
df[Col] = df[Col].str.replace('LegendT', '')
# Replace 'LegendM' with an empty string in the current column.
df[Col] = df[Col].str.replace('LegendM', '')
# Display the DataFrame after the specified replacements.
display(df)
DAY | Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 01 | 21.7 | 12.3 | 17.0 | 1.0 | 0.0 | 1.6 | 0.0 | 1.6 | 32 | 64 |
1 | 02 | 20.8 | 10.0 | 15.4 | 2.6 | 0.0 | 19.8 | 0.0 | 19.8 | 2 | 53 |
2 | 03 | 17.2 | 10.3 | 13.8 | 4.2 | 0.0 | 5.2 | 0.0 | 5.2 | 32 | 54 |
3 | 04 | 21.4 | 10.2 | 15.8 | 2.2 | 0.0 | 18.4 | 0.0 | 18.4 | 36 | 46 |
4 | 05 | 23.6 | 9.6 | 16.6 | 1.4 | 0.0 | 0.4 | 0.0 | 0.4 | 12 | 36 |
5 | 06 | 23.3 | 13.6 | 18.5 | 0.0 | 0.5 | T | 0.0 | T | 1 | 38 |
6 | 07 | 26.9 | 11.6 | 19.3 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | 16 | 40 |
7 | 08 | 29.8 | 14.0 | 21.9 | 0.0 | 3.9 | 0.0 | 0.0 | 0.0 | 33 | 35 |
8 | 09 | 29.2 | 15.6 | 22.4 | 0.0 | 4.4 | 0.0 | 0.0 | 0.0 | 14 | 47 |
9 | 10 | 34.2 | 16.8 | 25.5 | 0.0 | 7.5 | 0.0 | 0.0 | 0.0 | 25 | 33 |
10 | 11 | M | M | M | M | M | 0.0 | 0.0 | 0.0 | 35 | 45 |
11 | 12 | 24.1 | 15.2 | 19.7 | 0.0 | 1.7 | 0.0 | 0.0 | 0.0 | 9 | 64 |
12 | 13 | 28.0 | 11.5 | 19.8 | 0.0 | 1.8 | 0.0 | 0.0 | 0.0 | 4 | 42 |
13 | 14 | 22.4 | 13.8 | 18.1 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 4 | 40 |
14 | 15 | 28.4 | 13.0 | 20.7 | 0.0 | 2.7 | 0.0 | 0.0 | 0.0 | 3 | 42 |
15 | 16 | 29.1 | 16.6 | 22.9 | 0.0 | 4.9 | 0.0 | 0.0 | 0.0 | 17 | 38 |
16 | 17 | 31.6 | 16.3 | 24.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 15 | 34 |
17 | 18 | 32.2 | 16.0 | 24.1 | 0.0 | 6.1 | 0.0 | 0.0 | 0.0 | 23 | 44 |
18 | 19 | 32.7 | 15.8 | 24.3 | 0.0 | 6.3 | 0.0 | 0.0 | 0.0 | 34 | 49 |
19 | 20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
20 | Sum | NaN | NaN | NaN | 11.4 | 47.2 | 45.4 | 0.0 | 45.4 | NaN | NaN |
21 | Avg | 26.5 | 13.5 | 20.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
22 | Xtrm | 34.2 | 9.6 | NaN | NaN | NaN | 19.8 | 0.0 | 19.8 | 9 | 64LegendSS |
Pandas
set_index
Method:The
set_index
method in Pandas is used to set one or more columns as the index of a DataFrame. When you set an index, it means you are designating one or more columns to serve as the row labels, providing a way to uniquely identify and access rows in the DataFrame.Here’s the primary parameter of the
set_index
method:keys
: This parameter specifies the column or columns you want to use as the new index.
df.set_index(keys, inplace=False)
df
: Refers to the Pandas DataFrame.keys
: The column or columns that you want to set as the new index. This can be a single column name or a list of column names.
# Set the 'DAY' column as the index for the DataFrame.
df.set_index('DAY', inplace=True)
# Display the DataFrame with 'DAY' as the new index.
display(df)
Max Temp Definition °C | Min Temp Definition °C | Mean Temp Definition °C | Heat Deg Days Definition | Cool Deg Days Definition | Total Rain Definition mm | Total Snow Definition cm | Total Precip Definition mm | Dir of Max Gust Definition 10's deg | Spd of Max Gust Definition km/h | |
---|---|---|---|---|---|---|---|---|---|---|
DAY | ||||||||||
01 | 21.7 | 12.3 | 17.0 | 1.0 | 0.0 | 1.6 | 0.0 | 1.6 | 32 | 64 |
02 | 20.8 | 10.0 | 15.4 | 2.6 | 0.0 | 19.8 | 0.0 | 19.8 | 2 | 53 |
03 | 17.2 | 10.3 | 13.8 | 4.2 | 0.0 | 5.2 | 0.0 | 5.2 | 32 | 54 |
04 | 21.4 | 10.2 | 15.8 | 2.2 | 0.0 | 18.4 | 0.0 | 18.4 | 36 | 46 |
05 | 23.6 | 9.6 | 16.6 | 1.4 | 0.0 | 0.4 | 0.0 | 0.4 | 12 | 36 |
06 | 23.3 | 13.6 | 18.5 | 0.0 | 0.5 | T | 0.0 | T | 1 | 38 |
07 | 26.9 | 11.6 | 19.3 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | 16 | 40 |
08 | 29.8 | 14.0 | 21.9 | 0.0 | 3.9 | 0.0 | 0.0 | 0.0 | 33 | 35 |
09 | 29.2 | 15.6 | 22.4 | 0.0 | 4.4 | 0.0 | 0.0 | 0.0 | 14 | 47 |
10 | 34.2 | 16.8 | 25.5 | 0.0 | 7.5 | 0.0 | 0.0 | 0.0 | 25 | 33 |
11 | M | M | M | M | M | 0.0 | 0.0 | 0.0 | 35 | 45 |
12 | 24.1 | 15.2 | 19.7 | 0.0 | 1.7 | 0.0 | 0.0 | 0.0 | 9 | 64 |
13 | 28.0 | 11.5 | 19.8 | 0.0 | 1.8 | 0.0 | 0.0 | 0.0 | 4 | 42 |
14 | 22.4 | 13.8 | 18.1 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 4 | 40 |
15 | 28.4 | 13.0 | 20.7 | 0.0 | 2.7 | 0.0 | 0.0 | 0.0 | 3 | 42 |
16 | 29.1 | 16.6 | 22.9 | 0.0 | 4.9 | 0.0 | 0.0 | 0.0 | 17 | 38 |
17 | 31.6 | 16.3 | 24.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 15 | 34 |
18 | 32.2 | 16.0 | 24.1 | 0.0 | 6.1 | 0.0 | 0.0 | 0.0 | 23 | 44 |
19 | 32.7 | 15.8 | 24.3 | 0.0 | 6.3 | 0.0 | 0.0 | 0.0 | 34 | 49 |
20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Sum | NaN | NaN | NaN | 11.4 | 47.2 | 45.4 | 0.0 | 45.4 | NaN | NaN |
Avg | 26.5 | 13.5 | 20.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Xtrm | 34.2 | 9.6 | NaN | NaN | NaN | 19.8 | 0.0 | 19.8 | 9 | 64LegendSS |
DataFrame.values: To access the underlying data as a NumPy array, you can do the following:
# Access the 'Max Temp Definition °C' column in the DataFrame.
df['Max Temp Definition °C']
DAY
01 21.7
02 20.8
03 17.2
04 21.4
05 23.6
06 23.3
07 26.9
08 29.8
09 29.2
10 34.2
11 M
12 24.1
13 28.0
14 22.4
15 28.4
16 29.1
17 31.6
18 32.2
19 32.7
20 NaN
Sum NaN
Avg 26.5
Xtrm 34.2
Name: Max Temp Definition °C, dtype: object
# Retrieve the values from the 'Max Temp Definition °C' column in the DataFrame.
df['Max Temp Definition °C'].values
array(['21.7', '20.8', '17.2', '21.4', '23.6', '23.3', '26.9', '29.8',
'29.2', '34.2', 'M', '24.1', '28.0', '22.4', '28.4', '29.1',
'31.6', '32.2', '32.7', nan, nan, '26.5', '34.2'], dtype=object)
Renaming Columns and Indices: To rename columns and indices, you can use the
rename
method. Here’s an example:
# Import the pprint function for pretty printing.
from pprint import pprint
# Create a dictionary for renaming columns by removing 'Definition' from their names.
column_rename_dict = dict(zip(df.columns, [x.replace('Definition', '') for x in df.columns]))
# Print the title for the column renaming dictionary.
print("Column Renaming Dictionary:")
pprint(column_rename_dict)
# Rename the columns in the DataFrame using the created dictionary.
df.rename(columns=column_rename_dict, inplace=True)
# Rename specific index labels for clarity.
df.rename(index={'01': '1st', '02': '2nd', '03': '3rd', '04': '4th', '05': '5th'}, inplace=True)
# Print the title for the DataFrame with renamed columns and index labels.
print("\nDataFrame with Renamed Columns and Index Labels:")
display(df)
Column Renaming Dictionary:
{'Cool Deg Days Definition': 'Cool Deg Days ',
"Dir of Max Gust Definition 10's deg": "Dir of Max Gust 10's deg",
'Heat Deg Days Definition': 'Heat Deg Days ',
'Max Temp Definition °C': 'Max Temp °C',
'Mean Temp Definition °C': 'Mean Temp °C',
'Min Temp Definition °C': 'Min Temp °C',
'Spd of Max Gust Definition km/h': 'Spd of Max Gust km/h',
'Total Precip Definition mm': 'Total Precip mm',
'Total Rain Definition mm': 'Total Rain mm',
'Total Snow Definition cm': 'Total Snow cm'}
DataFrame with Renamed Columns and Index Labels:
Max Temp °C | Min Temp °C | Mean Temp °C | Heat Deg Days | Cool Deg Days | Total Rain mm | Total Snow cm | Total Precip mm | Dir of Max Gust 10's deg | Spd of Max Gust km/h | |
---|---|---|---|---|---|---|---|---|---|---|
DAY | ||||||||||
1st | 21.7 | 12.3 | 17.0 | 1.0 | 0.0 | 1.6 | 0.0 | 1.6 | 32 | 64 |
2nd | 20.8 | 10.0 | 15.4 | 2.6 | 0.0 | 19.8 | 0.0 | 19.8 | 2 | 53 |
3rd | 17.2 | 10.3 | 13.8 | 4.2 | 0.0 | 5.2 | 0.0 | 5.2 | 32 | 54 |
4th | 21.4 | 10.2 | 15.8 | 2.2 | 0.0 | 18.4 | 0.0 | 18.4 | 36 | 46 |
5th | 23.6 | 9.6 | 16.6 | 1.4 | 0.0 | 0.4 | 0.0 | 0.4 | 12 | 36 |
06 | 23.3 | 13.6 | 18.5 | 0.0 | 0.5 | T | 0.0 | T | 1 | 38 |
07 | 26.9 | 11.6 | 19.3 | 0.0 | 1.3 | 0.0 | 0.0 | 0.0 | 16 | 40 |
08 | 29.8 | 14.0 | 21.9 | 0.0 | 3.9 | 0.0 | 0.0 | 0.0 | 33 | 35 |
09 | 29.2 | 15.6 | 22.4 | 0.0 | 4.4 | 0.0 | 0.0 | 0.0 | 14 | 47 |
10 | 34.2 | 16.8 | 25.5 | 0.0 | 7.5 | 0.0 | 0.0 | 0.0 | 25 | 33 |
11 | M | M | M | M | M | 0.0 | 0.0 | 0.0 | 35 | 45 |
12 | 24.1 | 15.2 | 19.7 | 0.0 | 1.7 | 0.0 | 0.0 | 0.0 | 9 | 64 |
13 | 28.0 | 11.5 | 19.8 | 0.0 | 1.8 | 0.0 | 0.0 | 0.0 | 4 | 42 |
14 | 22.4 | 13.8 | 18.1 | 0.0 | 0.1 | 0.0 | 0.0 | 0.0 | 4 | 40 |
15 | 28.4 | 13.0 | 20.7 | 0.0 | 2.7 | 0.0 | 0.0 | 0.0 | 3 | 42 |
16 | 29.1 | 16.6 | 22.9 | 0.0 | 4.9 | 0.0 | 0.0 | 0.0 | 17 | 38 |
17 | 31.6 | 16.3 | 24.0 | 0.0 | 6.0 | 0.0 | 0.0 | 0.0 | 15 | 34 |
18 | 32.2 | 16.0 | 24.1 | 0.0 | 6.1 | 0.0 | 0.0 | 0.0 | 23 | 44 |
19 | 32.7 | 15.8 | 24.3 | 0.0 | 6.3 | 0.0 | 0.0 | 0.0 | 34 | 49 |
20 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Sum | NaN | NaN | NaN | 11.4 | 47.2 | 45.4 | 0.0 | 45.4 | NaN | NaN |
Avg | 26.5 | 13.5 | 20.0 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Xtrm | 34.2 | 9.6 | NaN | NaN | NaN | 19.8 | 0.0 | 19.8 | 9 | 64LegendSS |