Remark
Please be aware that these lecture notes are accessible online in an ‘early access’ format. They are actively being developed, and certain sections will be further enriched to provide a comprehensive understanding of the subject matter.
3.7. Lab: Utilizing GeoPandas Package with Vector Data Model#
Geographic Information Systems (GIS) provide powerful tools for analyzing and visualizing spatial data. In this section, we’ll explore Alberta’s economic regions using Python’s GeoPandas library, demonstrating how to work with vector data and create informative visualizations.
3.7.1. Alberta’s Economic Regions Data#
Alberta is divided into eight economic regions, which are groupings of census divisions defined by Statistics Canada [Government of Alberta, 2024]. These regions are:
Lethbridge–Medicine Hat
Camrose–Drumheller
Calgary
Banff–Jasper–Rocky Mountain House
Red Deer
Edmonton
Athabasca–Grande Prairie–Peace River
Wood Buffalo–Cold Lake
These economic regions cover the entire province and are used for various statistical and analytical purposes [Government of Alberta, 2024].
3.7.2. Obtaining the Shapefile#
To work with Alberta’s economic regions in GeoPandas, we need a shapefile containing the geographical boundaries of these regions. The Alberta government provides geospatial data resources that include these boundaries [Government of Alberta, 2024]. The shapefile for economic regions can be found in the Alberta Census Boundaries package, which is updated every 5 years with the census [Government of Alberta, 2024].
You can download the shapefile from the Alberta government’s open data portal. While the specific link you provided is for the Canadian government’s open data portal, the Alberta-specific data would be more appropriate for this analysis. The Alberta government’s geospatial resources include:
Economic Regions (ER)
Census Divisions (CD)
Census Subdivisions (CSD)
Census Metropolitan Areas (CMA)
Census Agglomerations (CA)
Census Tracts (CT)
Dissemination Areas (DA)
These files are typically available in shapefile (SHP) format, which is compatible with GeoPandas.
3.7.3. Setting Up the Environment#
Before we begin our analysis, we need to import the necessary libraries. These libraries are essential for handling geospatial data and performing GIS operations in Python:
pandas: This powerful data manipulation library provides data structures for efficiently storing and analyzing large datasets. It offers tools for reading and writing data in various formats, data cleaning, merging, and aggregation. In GIS analysis, pandas is often used for handling attribute data associated with spatial features.
shapely: This library is specifically designed for manipulating and analyzing geometric objects in the Cartesian plane. It provides a set of geometric operations like intersections, unions, and buffering. In GIS work, shapely is crucial for working with the geometry of spatial features, such as points, lines, and polygons.
geopandas: This library extends the capabilities of pandas to work with geospatial data. It combines the data analysis tools of pandas with the geometric operations of shapely. GeoPandas introduces the GeoDataFrame, which is like a pandas DataFrame but with a special column for storing geometry objects. This allows for easy integration of spatial operations with attribute-based analysis.
matplotlib: This comprehensive library for creating static, animated, and interactive visualizations includes:
pyplot: The main plotting interface
colors: Tools for color mapping and manipulation
patches: Objects for adding shapes to plots
Package |
Description |
Documentation |
---|---|---|
Data manipulation and analysis library |
||
Library for manipulation and analysis of geometric objects |
||
Extension of pandas for geospatial data |
||
Comprehensive plotting library |
3.7.4. Loading the Data#
To begin our analysis of Alberta’s economic regions, we’ll import the data from a GeoPackage file using GeoPandas:
Show code cell source
import geopandas as gpd
# Load the GeoPackage file
gdf = gpd.read_file('../data/Alberta_Economic_Region_2021.gpkg')
# Display information about the GeoDataFrame
print(f"Type of object: {type(gdf)}")
print(f"Shape of the dataset: {gdf.shape}")
print(f"Coordinate Reference System (CRS): {gdf.crs}")
Type of object: <class 'geopandas.geodataframe.GeoDataFrame'>
Shape of the dataset: (8, 8)
Coordinate Reference System (CRS): EPSG:3400
This code snippet accomplishes several key tasks:
Data Import: The
gpd.read_file()
function is used to load the GeoPackage file containing Alberta’s economic region data. GeoPackage is a popular open format for geospatial data.Object Type: The output confirms that we’ve created a GeoDataFrame, which is a GeoPandas data structure that extends pandas DataFrame to handle geospatial data.
Dataset Shape: The shape (8, 8) indicates that our dataset contains 8 rows and 8 columns. Each row likely represents one of Alberta’s economic regions, while the columns contain various attributes and the geometry information.
Coordinate Reference System (CRS): The CRS is EPSG:3400, which is the Alberta 10-TM (Forest) projection. This is a specific projection optimized for Alberta’s geography, ensuring accurate spatial representations and calculations within the province.
3.7.5. Exploring the Data#
Now, let’s delve deeper into the contents of our GeoDataFrame:
3.7.5.1. Dataset Columns#
Show code cell source
# Display column names
print("Columns in the dataset:")
print(gdf.columns.tolist())
Columns in the dataset:
['ERUID', 'DGUID', 'ERNAME', 'LANDAREA', 'PRUID', 'Shape_Leng', 'Shape_Area', 'geometry']
This code snippet reveals the structure of our GeoDataFrame by listing all column names:
ERUID: Economic Region Unique Identifier, likely a unique code for each economic region.
DGUID: Dissemination Geography Unique Identifier, used in Statistics Canada’s data organization system.
ERNAME: The name of the economic region.
LANDAREA: Represents the land area of each region, typically in square kilometers.
PRUID: Province Unique Identifier, which should be consistent for all rows as they’re all in Alberta.
Shape_Leng: The length of the shape boundary, usually in the units of the CRS (meters for EPSG:3400).
Shape_Area: The area of the shape, usually in square units of the CRS.
geometry: The geometric representation of each region, used for spatial operations and mapping.
Understanding these columns is crucial for our analysis:
Identifier columns (ERUID, DGUID, PRUID) help in joining this data with other datasets.
ERNAME allows us to work with human-readable region names.
LANDAREA provides a quick reference for the size of each region.
Shape_Leng and Shape_Area offer precise measurements of the regions’ boundaries and areas.
The geometry column is essential for all spatial operations and visualizations.
3.7.5.2. Dataset Rows#
Show code cell source
# Show the rows of the dataset
print("The rows of the dataset:")
display(gdf)
The rows of the dataset:
ERUID | DGUID | ERNAME | LANDAREA | PRUID | Shape_Leng | Shape_Area | geometry | |
---|---|---|---|---|---|---|---|---|
0 | 4810 | 2021S05004810 | Lethbridge--Medicine Hat | 51458.9291 | 48 | 1.440439e+06 | 5.324693e+10 | MULTIPOLYGON (((703274.537 5666369.084, 703265... |
1 | 4820 | 2021S05004820 | Camrose--Drumheller | 76750.2173 | 48 | 1.852875e+06 | 8.099823e+10 | MULTIPOLYGON (((660774.759 5983545.507, 660866... |
2 | 4830 | 2021S05004830 | Calgary | 12614.1750 | 48 | 6.908214e+05 | 1.286447e+10 | MULTIPOLYGON (((508012.266 5746445.996, 509027... |
3 | 4840 | 2021S05004840 | Banff--Jasper--Rocky Mountain House | 74007.8998 | 48 | 2.558130e+06 | 7.489139e+10 | MULTIPOLYGON (((174529.412 5986604.373, 174904... |
4 | 4850 | 2021S05004850 | Red Deer | 9890.0547 | 48 | 5.468708e+05 | 1.027202e+10 | MULTIPOLYGON (((531544.883 5857499.139, 531665... |
5 | 4860 | 2021S05004860 | Edmonton | 15746.4155 | 48 | 8.824476e+05 | 1.642910e+10 | MULTIPOLYGON (((585588.205 5981102.795, 587177... |
6 | 4870 | 2021S05004870 | Athabasca--Grande Prairie--Peace River | 268301.4042 | 48 | 2.605660e+06 | 2.767169e+11 | MULTIPOLYGON (((557650.763 6522880.284, 557713... |
7 | 4880 | 2021S05004880 | Wood Buffalo--Cold Lake | 125889.1713 | 48 | 2.182276e+06 | 1.372955e+11 | MULTIPOLYGON (((778998.869 6654017.598, 779055... |
This code displays all rows of our GeoDataFrame, providing a comprehensive overview of Alberta’s economic regions:
ERUID (Economic Region Unique Identifier):
Ranges from 4810 to 4880
Unique code for each economic region
DGUID (Dissemination Geography Unique Identifier):
Format: “2021S0500XXXX”
Used in Statistics Canada’s data organization system
ERNAME (Economic Region Name):
Lists all eight economic regions in Alberta
Provides human-readable identifiers for each region
LANDAREA:
Represents land area in square kilometers
Ranges from 9,890 km² (Red Deer) to 268,301 km² (Athabasca–Grande Prairie–Peace River)
Allows for size comparisons between regions
PRUID (Province Unique Identifier):
Consistently “48” for all rows, representing Alberta
Shape_Leng:
Length of each region’s boundary in meters
Useful for perimeter calculations and comparisons
Shape_Area:
Area of each region in square meters
Provides precise area measurements, complementing the LANDAREA column
geometry:
MULTIPOLYGON objects representing the geographical boundaries
Essential for spatial operations and mapping
Key observations:
The dataset covers all of Alberta’s economic regions, providing a complete picture of the province’s economic geography.
There’s significant variation in land area between regions, with Athabasca–Grande Prairie–Peace River being the largest and Red Deer the smallest.
The geometry column allows for advanced spatial analysis and visualization of these regions.
3.7.5.3. Understanding the Geometry Column#
In our GeoDataFrame (gdf
), each entry under the geometry column is represented as a MULTIPOLYGON. Understanding this geometry type is crucial for effectively working with spatial data. To see how MULTIPOLYGONs are represented in our GeoDataFrame, we can display an example entry:
Show code cell source
# Display the geometry of the Calgary economic region
display(gdf.geometry[2])
This command will show the geometric representation of the Calgary economic region in our dataset. The output will be a MULTIPOLYGON object that contains the coordinates defining its shape.
Use Cases:
MULTIPOLYGONs are particularly useful for representing geographical features that consist of multiple parts. For example, they can represent regions that include islands or complex boundaries.
In our dataset (
gdf
), each economic region in Alberta is represented as a MULTIPOLYGON, which is essential for accurately modeling the spatial characteristics of these regions.
Practical Implications:
When analyzing Alberta’s economic regions using GeoPandas, recognizing that each region is represented as a MULTIPOLYGON helps us understand how to manipulate and visualize these geometries effectively.
Operations such as area calculations, spatial joins, and visualizations will rely on the properties of MULTIPOLYGONs to ensure accurate results.
For instance, when we calculate the area of each economic region or visualize them on a map, the MULTIPOLYGON representation allows us to account for all parts of the region accurately.
3.7.5.4. explore
method#
Show code cell source
# Create an interactive map
gdf.explore("ERNAME")
Explanation:
Purpose: This code generates an interactive map where each economic region is color-coded based on its name (
ERNAME
).Color Mapping: The
explore()
method automatically assigns a unique color to each distinct value in the"ERNAME"
column, making it easy to visually distinguish between different economic regions.Tooltips: When hovering over a region, a tooltip will appear showing the region’s name along with other attributes.
Interactive Features: Users can zoom in/out, pan across the map, and interact with individual regions.
Legend: A legend is automatically included, displaying the color associated with each region name.
Use Case: This visualization is particularly useful for:
Quickly identifying the geographical location of each named economic region
Understanding the spatial distribution and relative sizes of different regions
Facilitating comparisons between regions based on their names and locations
For more advanced customization options and additional parameters of the explore()
method, refer to the official GeoPandas documentation:
GeoPandas GeoDataFrame.explore Documentation
3.7.6. Subsetting the Data#
We can create subsets of our data based on specific conditions, which is useful for focusing on particular aspects of Alberta’s economic regions. Let’s explore various ways to subset our GeoDataFrame:
3.7.6.1. Selecting Specific Columns#
To focus on particular attributes, we can select specific columns:
Show code cell source
# Select only the name and geometry columns
gdf_subset = gdf[['ERNAME', 'LANDAREA', 'geometry']]
display(gdf_subset)
ERNAME | LANDAREA | geometry | |
---|---|---|---|
0 | Lethbridge--Medicine Hat | 51458.9291 | MULTIPOLYGON (((703274.537 5666369.084, 703265... |
1 | Camrose--Drumheller | 76750.2173 | MULTIPOLYGON (((660774.759 5983545.507, 660866... |
2 | Calgary | 12614.1750 | MULTIPOLYGON (((508012.266 5746445.996, 509027... |
3 | Banff--Jasper--Rocky Mountain House | 74007.8998 | MULTIPOLYGON (((174529.412 5986604.373, 174904... |
4 | Red Deer | 9890.0547 | MULTIPOLYGON (((531544.883 5857499.139, 531665... |
5 | Edmonton | 15746.4155 | MULTIPOLYGON (((585588.205 5981102.795, 587177... |
6 | Athabasca--Grande Prairie--Peace River | 268301.4042 | MULTIPOLYGON (((557650.763 6522880.284, 557713... |
7 | Wood Buffalo--Cold Lake | 125889.1713 | MULTIPOLYGON (((778998.869 6654017.598, 779055... |
This code demonstrates how to create a subset of the original GeoDataFrame by selecting specific columns:
Column Selection:
We’ve chosen ‘ERNAME’ (Economic Region Name), ‘LANDAREA’, and ‘geometry’.
This selection focuses on the essential information for spatial analysis of the regions.
Resulting GeoDataFrame:
The new
gdf_subset
contains only these three columns for all eight economic regions.It retains the spatial information (geometry) necessary for mapping and spatial operations.
Benefits of Subsetting:
Simplifies the dataset by removing unnecessary columns.
Focuses the analysis on specific attributes of interest.
Reduces memory usage when working with large datasets.
Data Overview:
ERNAME: Provides the names of all eight economic regions.
LANDAREA: Shows the land area of each region in square kilometers.
geometry: Retains the MULTIPOLYGON objects for spatial analysis and visualization.
Use Cases:
This subset is ideal for analyses that focus on the size and spatial distribution of economic regions.
It can be used for creating maps that show the relative sizes of regions.
The simplified structure makes it easier to perform calculations or comparisons based on land area.
3.7.6.2. Filtering Rows Based on Conditions#
We can filter rows based on specific criteria:
Show code cell source
# Filter for regions with land area greater than 100,000 km²
large_regions = gdf[gdf['LANDAREA'] > 100000]
display(large_regions[['ERNAME', 'LANDAREA']])
large_regions.explore('ERNAME', tooltip = ['ERNAME', 'LANDAREA'])
ERNAME | LANDAREA | |
---|---|---|
6 | Athabasca--Grande Prairie--Peace River | 268301.4042 |
7 | Wood Buffalo--Cold Lake | 125889.1713 |
This code demonstrates how to filter the GeoDataFrame based on a specific condition:
Filtering Condition:
We’re selecting regions where the ‘LANDAREA’ is greater than 100,000 km².
This condition identifies the largest economic regions in Alberta.
Result of Filtering:
Two regions meet this criterion:
Athabasca–Grande Prairie–Peace River (268,301.4042 km²)
Wood Buffalo–Cold Lake (125,889.1713 km²)
These are the two largest economic regions in Alberta by land area.
Significance:
This filtering technique is useful for focusing on specific subsets of data based on quantitative criteria.
In this case, it highlights the regions that cover the majority of Alberta’s land area.
Potential Applications:
Analyzing resource distribution in large, potentially less densely populated areas.
Studying land management practices across extensive territories.
Comparing economic activities and challenges in large vs. smaller regions.
3.7.6.3. Selecting a Specific Region#
To focus on a single region:
Show code cell source
# Filter for a specific economic region
calgary_region = gdf[gdf['ERNAME'] == 'Calgary']
calgary_region.explore('ERNAME', cmap = 'Greens')
This code demonstrates how to isolate and visualize a specific economic region from the GeoDataFrame.
3.7.6.4. Combining Multiple Conditions#
We can use multiple conditions to create more complex subsets:
Show code cell source
# Select regions with land area between 10,000 and 50,000 km²
medium_regions = gdf[(gdf['LANDAREA'] > 10000) & (gdf['LANDAREA'] < 50000)]
display(large_regions[['ERNAME', 'LANDAREA']])
medium_regions.explore('ERNAME', cmap = 'bwr', tooltip = ['ERNAME', 'LANDAREA'])
ERNAME | LANDAREA | |
---|---|---|
6 | Athabasca--Grande Prairie--Peace River | 268301.4042 |
7 | Wood Buffalo--Cold Lake | 125889.1713 |
This code demonstrates how to apply multiple conditions to filter the GeoDataFrame and visualize the results:
Complex Filtering:
The condition
(gdf['LANDAREA'] > 10000) & (gdf['LANDAREA'] < 50000)
selects regions with land areas between 10,000 and 50,000 km².The
&
operator combines two conditions, requiring both to be true.
Significance of This Approach:
Allows for more nuanced data selection based on multiple criteria.
Useful for identifying regions with specific characteristics (in this case, medium-sized areas).
Potential Applications:
Analyzing characteristics of medium-sized economic regions.
Comparing economic indicators or population distribution in regions of similar size.
Identifying patterns or trends specific to regions within a certain size range.
3.7.6.5. Changing the Coordinate Reference System (CRS)#
Before performing spatial operations, it’s often necessary to ensure our data is in an appropriate CRS. Let’s convert our GeoDataFrame to a more common projection:
Show code cell source
# Check the current CRS
print("Current CRS:", gdf.crs)
# Convert to EPSG:4326 (WGS84, a common geographic coordinate system)
gdf_wgs84 = gdf.to_crs(epsg=4326)
# Verify the new CRS
print("New CRS:", gdf_wgs84.crs)
# Display the first few rows to see the change in coordinates
display(gdf_wgs84[['ERNAME', 'geometry']])
Current CRS: EPSG:3400
New CRS: EPSG:4326
ERNAME | geometry | |
---|---|---|
0 | Lethbridge--Medicine Hat | MULTIPOLYGON (((-112.09335 51.13304, -112.0934... |
1 | Camrose--Drumheller | MULTIPOLYGON (((-112.54626 53.99669, -112.5448... |
2 | Calgary | MULTIPOLYGON (((-114.88353 51.88943, -114.8687... |
3 | Banff--Jasper--Rocky Mountain House | MULTIPOLYGON (((-119.96281 53.94668, -119.9569... |
4 | Red Deer | MULTIPOLYGON (((-114.53099 52.88737, -114.5292... |
5 | Edmonton | MULTIPOLYGON (((-113.69397 53.99268, -113.6697... |
6 | Athabasca--Grande Prairie--Peace River | MULTIPOLYGON (((-114.00002 58.86537, -114.0000... |
7 | Wood Buffalo--Cold Lake | MULTIPOLYGON (((-110 59.95258, -110 59.94582, ... |
This code does the following:
Displays the current CRS of our GeoDataFrame.
Converts the GeoDataFrame to EPSG:4326 (WGS84), a widely used geographic coordinate system.
Verifies the new CRS.
Displays a few rows to show how the geometry coordinates have changed.
Converting to EPSG:4326 is useful because:
It’s a standard geographic coordinate system used worldwide.
It’s compatible with many mapping services and GIS software.
It allows for easier integration with other global datasets.
After this conversion, we can proceed with spatial subsetting operations, which will now use latitude and longitude coordinates.
3.7.6.6. Spatial Subsetting#
Now that our data is in a geographic coordinate system, we can perform spatial subsetting:
Show code cell source
# Create a point geometry for Edmonton's city center
from shapely.geometry import Point
edmonton_point = Point(-113.490112, 53.545883)
# Find the region that contains Edmonton
edmonton_region = gdf_wgs84[gdf_wgs84.contains(edmonton_point)]
print("The economic region containing Edmonton:")
print(edmonton_region['ERNAME'].values[0])
# Visualize the result using explore
m = gdf_wgs84.explore(color='thistle', tooltip=['ERNAME'])
edmonton_region.explore(color='orangered ', m=m, tooltip=['ERNAME'])
The economic region containing Edmonton:
Edmonton
This code does the following:
Create Edmonton point:
Use
Point(-113.490112, 53.545883)
to represent Edmonton’s location
Spatial subsetting:
gdf_wgs84[gdf_wgs84.contains(edmonton_point)]
finds the region containing Edmonton
Display result:
Print the name of the region containing Edmonton
Visualize:
Create base map:
gdf_wgs84.explore(color='thistle')
Highlight Edmonton’s region:
edmonton_region.explore(color='orangered', m=m)
Add tooltips to show region names on hover
3.7.7. Plotting with GeoDataFrame’s plot() method#
We can create static maps using the plot()
method of our GeoDataFrame. This method is based on matplotlib and offers various customization options:
Show code cell source
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap
import matplotlib.patheffects as pe
# Create a custom colormap
colors = ['#FFFFCC', '#FFEDA0', '#FED976', '#FEB24C',
'#FD8D3C', '#FC4E2A', '#E31A1C', '#BD0026', '#800026']
n_bins = len(colors)
cmap = LinearSegmentedColormap.from_list('custom_YlOrRd', colors, N=n_bins)
# Create a basic plot of all regions
fig, ax = plt.subplots(figsize=(8, 8))
gdf.plot(ax=ax, edgecolor='black', linewidth=0.5, column='LANDAREA', cmap=cmap)
# Add a color bar
sm = plt.cm.ScalarMappable(cmap=cmap, norm=plt.Normalize(
vmin=gdf['LANDAREA'].min(), vmax=gdf['LANDAREA'].max()))
sm._A = [] # This line is necessary for the colorbar to work with plt.colorbar()
cbar = fig.colorbar(sm, ax=ax, fraction=0.03, pad=0.04)
cbar.set_label('Land Area (km²)', rotation=270, labelpad=15)
# Add title and remove axes
ax.set_title('Alberta Economic Regions - Land Area',
fontsize=14, fontweight='bold', pad=0, y=1)
ax.axis('off')
# Add text labels for each region with white outline
for idx, row in gdf.iterrows():
ax.annotate(text=row['ERNAME'], xy=(row.geometry.centroid.x, row.geometry.centroid.y),
xytext=(0, 0), textcoords="offset points", fontsize=10, ha='center', va='center',
fontweight='bold', color='black',
path_effects=[pe.withStroke(linewidth=3, foreground='white')])
plt.tight_layout()
This code creates a map of Alberta’s economic regions, color-coded by land area:
Import libraries and set custom colors.
Plot the GeoDataFrame with ‘LANDAREA’ for color-coding.
Add a color bar to represent land area values.
Set title and remove axes.
Add region names as text labels with white outlines.
Optimize layout with tight_layout().