4.3. Advanced Numpy Concepts#

4.3.1. Data types#

Understanding NumPy data types is essential for efficiently working with numerical data and performing various mathematical operations. Here’s a comprehensive overview of NumPy data types:

NumPy data types are characterized by their names, which typically consist of a root type name followed by a number representing the number of bits used to store each item of the array. For example, int32 indicates a 32-bit integer data type. Here are the main categories of NumPy data types [Harris et al., 2020, NumPy Developers, 2023]:

  1. Integer Types:

    • int8, int16, int32, int64: Signed integers with varying bit depths.

    • uint8, uint16, uint32, uint64: Unsigned integers with varying bit depths.

    Table 4.11 provides a detailed overview of various data types, including their descriptions, smallest and largest values, and memory storage sizes.

    Table 4.11 Data Types and Their Characteristics#

    Data Type

    Description

    Smallest Value

    Largest Value

    Memory Storage Size (in Bytes)

    int8

    Signed 8-bit integer with a range of -128 to 127.

    -128

    127

    1 byte

    int16

    Signed 16-bit integer with a range of -32,768 to 32,767.

    -32,768

    32,767

    2 bytes

    int32

    Signed 32-bit integer with a range of -2,147,483,648 to 2,147,483,647.

    -2,147,483,648

    2,147,483,647

    4 bytes

    int64

    Signed 64-bit integer with a wide range of values.

    -9,223,372,036,854,775,808

    9,223,372,036,854,775,807

    8 bytes

    uint8

    Unsigned 8-bit integer with a range of 0 to 255.

    0

    255

    1 byte

    uint16

    Unsigned 16-bit integer with a range of 0 to 65,535.

    0

    65,535

    2 bytes

    uint32

    Unsigned 32-bit integer with a range of 0 to 4,294,967,295.

    0

    4,294,967,295

    4 bytes

    uint64

    Unsigned 64-bit integer with a wide range of values.

    0

    18,446,744,073,709,551,615

    8 bytes

  2. Floating-Point Types:

    • float16, float32, float64, float128: Floating-point numbers with varying levels of precision.

    • complex64, complex128, complex256: Complex numbers with varying levels of precision.

  3. Boolean Type:

    • bool: Represents boolean values True and False.

  4. Strings:

    • string_: Fixed-size ASCII string type.

    • unicode_: Fixed-size Unicode string type.

  5. Datetime Types:

    • datetime64[D]: Date with day precision.

    • datetime64[M]: Date with month precision.

    • datetime64[Y]: Date with year precision.

    • datetime64[h]: Time with hour precision.

    • datetime64[m]: Time with minute precision.

    • datetime64[s]: Time with second precision.

    • datetime64[ms]: Time with millisecond precision.

    • datetime64[us]: Time with microsecond precision.

    • datetime64[ns]: Time with nanosecond precision.

  6. Timedelta Type:

    • timedelta64[D]: Time interval with day precision.

    • timedelta64[h]: Time interval with hour precision.

    • timedelta64[m]: Time interval with minute precision.

    • timedelta64[s]: Time interval with second precision.

    • timedelta64[ms]: Time interval with millisecond precision.

    • timedelta64[us]: Time interval with microsecond precision.

    • timedelta64[ns]: Time interval with nanosecond precision.

  7. Object Type:

    • object: A generic Python object type that can store any Python object.

import numpy as np
from pprint import pprint

def print_bold(txt):
    print("\033[1m" + txt + "\033[0m")
    
# Create NumPy arrays with specific data types
int_array = np.array([1, 2, 3], dtype=np.int32)
float_array = np.array([1.1, 2.2, 3.3], dtype=np.float64)
bool_array = np.array([True, False, True], dtype=np.bool_)

# Print arrays with specified data types
print_bold("Integer Array (dtype=np.int32):")
pprint(int_array, width=20)  # Using pprint to format the output

print_bold("Float Array (dtype=np.float64):")
pprint(float_array, width=20)  # Using pprint to format the output

print_bold("Boolean Array (dtype=np.bool_):")
pprint(bool_array, width=20)  # Using pprint to format the output
Integer Array (dtype=np.int32):
array([1, 2, 3])
Float Array (dtype=np.float64):
array([1.1, 2.2, 3.3])
Boolean Array (dtype=np.bool_):
array([ True, False,  True])

It’s important to note that when working with NumPy arrays, you can specify the data type using the dtype parameter. This allows you to control memory usage and precision according to your specific needs. For example:

4.3.1.1. Array Types and Type Conversions in NumPy#

NumPy provides an extensive range of numerical data types that go beyond what Python’s standard types offer. This section outlines the available array data types and how to modify the data type of an array [Harris et al., 2020, NumPy Developers, 2023].

Table 4.12 The supported data types closely align with those in the C programming language.#

NumPy Type

C Type

Description

numpy.bool_

bool

Boolean (True or False) stored as a byte

numpy.byte

signed char

Platform-defined

numpy.ubyte

unsigned char

Platform-defined

numpy.short

short

Platform-defined

numpy.ushort

unsigned short

Platform-defined

numpy.intc

int

Platform-defined

numpy.uintc

unsigned int

Platform-defined

numpy.int_

long

Platform-defined

numpy.uint

unsigned long

Platform-defined

numpy.longlong

long long

Platform-defined

numpy.ulonglong

unsigned long long

Platform-defined

numpy.half / numpy.float16

Half precision float

Sign bit, 5 bits exponent, 10 bits mantissa

numpy.single

float

Platform-defined single precision float (typically sign bit, 8 bits exponent, 23 bits mantissa)

numpy.double

double

Platform-defined double precision float (typically sign bit, 11 bits exponent, 52 bits mantissa)

numpy.longdouble

long double

Platform-defined extended-precision float

numpy.csingle

float complex

Complex number represented by two single-precision floats (real and imaginary components)

numpy.cdouble

double complex

Complex number represented by two double-precision floats (real and imaginary components)

numpy.clongdouble

long double complex

Complex number represented by two extended-precision floats (real and imaginary components)

These data types enable you to control memory usage and precision in your numerical computations. When working with arrays, you can modify their data type using the dtype parameter to suit your specific requirements. NumPy’s rich variety of types empowers you to efficiently perform various mathematical operations and manipulations on numerical data.

4.3.1.2. Using .dtype#

In NumPy, you can check the data type of a NumPy array using the .dtype attribute. Here’s how you can do it:

# Create a NumPy array
data = np.array([1, 2, 3, 4, 5])

# Check the data type of the array
data_type = data.dtype

# Print the data type
print("Data type:", data_type)
Data type: int32

In this example, the data type of the data array is int32, which means it’s a 32-bit integer.

Keep in mind that NumPy’s data types are often specified with a combination of a root type (like int, float, complex, etc.) and the number of bits used to represent each element. For instance, int64 represents a 64-bit integer, and float32 represents a 32-bit floating-point number.

If you want to explicitly convert the data type of a NumPy array, you can use the .astype() method:

# Create a NumPy array
data = np.array([1, 2, 3, 4, 5])

# Convert the data to a different data type (float16)
data_float = data.astype(np.float16)

# Print the data type of the converted array
print("Data type of converted array:", data_float.dtype)
Data type of converted array: float16

This code converts the data type of the data array to a 64-bit floating-point number and assigns it to the data_float variable.

4.3.1.3. Data type conversion#

Data type conversion plays a crucial role in managing memory usage and optimizing storage on hard disks in various computational applications. Understanding and effectively managing data types can lead to significant improvements in both memory utilization and storage efficiency. Here’s why data type conversion is important in terms of memory and hard disk savings [Harris et al., 2020, Mohbey and Bakariya, 2021, NumPy Developers, 2023]:

4.3.1.3.1. Memory Usage:#

  1. Memory Allocation: Different data types occupy varying amounts of memory. Choosing the appropriate data type for your data helps minimize memory wastage. For instance, using int16 instead of int64 can reduce memory usage when the data doesn’t require 64 bits of precision.

  2. Array Size: Arrays of larger data types consume more memory. Converting data to smaller data types can reduce the memory footprint of your arrays, which is especially beneficial when dealing with large datasets.

  3. Caching and Performance: Smaller data types can fit more data in cache, leading to better performance due to reduced cache misses. This can significantly improve computational speed.

4.3.1.3.2. Hard Disk Storage:#

  1. File Size: Saving data with smaller data types results in smaller file sizes, which leads to more efficient storage on hard disks. This is particularly important when dealing with large datasets that need to be stored and transferred.

  2. I/O Operations: Smaller data types require less time for input/output operations, such as reading from or writing to disk. This can speed up data loading and processing times.

  3. Database Efficiency: When working with databases, smaller data types reduce the space required to store records, improving overall database performance and reducing storage costs.

4.3.1.3.3. Serialization and Network Communication:#

  1. Serialization: When transmitting data across networks or saving data to files in serialized formats (like JSON or CSV), smaller data types result in shorter serialized representations, reducing network traffic and file size.

  2. Network Latency: Transmitting smaller data types over networks reduces network latency and improves data transfer times.

4.3.2. Introduction to Numba#

Numba [Numba Developers, 2023] is a powerful tool in the realm of Python programming, designed specifically to address one of Python’s inherent trade-offs: its ease of use and flexibility versus its execution speed. Python is known for its readability and versatility, making it a preferred language for a wide range of applications. However, its interpreted nature can lead to slower execution speeds, especially for computationally intensive tasks.

Numba steps in to bridge this gap by offering a Just-In-Time (JIT) compiler for Python. This compiler dynamically translates Python code into optimized machine code at runtime. This means that you can write high-level Python code and, with the addition of a simple decorator, achieve performance levels close to that of languages like C or Fortran. Numba is particularly valuable in scientific computing, numerical simulations, data analysis, and any application where computational efficiency is paramount [Numba Developers, 2023].

Now, let’s revisit the previous notes on Numba’s relationship with NumPy [Numba Developers, 2023]:

Relationship to NumPy:

  1. NumPy Integration: Numba and NumPy are often used together to enhance the performance of numerical operations in Python.

  2. NumPy Arrays: NumPy provides efficient data structures for working with arrays and matrices. Numba can be applied to functions that operate on these arrays to further optimize their execution speed.

  3. Example: Developers frequently write high-level code using NumPy to manipulate arrays and matrices. Then, they selectively apply Numba to specific functions within their codebase where performance optimization is essential. This synergy allows them to maintain code readability while achieving superior execution speed.

  4. Parallelization: Numba offers parallelization capabilities, which enable multicore CPU utilization. This can be especially advantageous for tasks involving extensive datasets or complex calculations, where parallel processing can significantly expedite the computation.

Example:

# This example is from 
# https://numba.pydata.org/numba-doc/dev/user/5minguide.html
# with minor modifications

from numba import jit
import numpy as np
import time

# Create a 100x100 NumPy array
x = np.arange(int(1e8)).reshape(int(1e4), int(1e4))

# Define a JIT-compiled function
@jit(nopython=True)
def go_fast(a):
    # Compute the trace of the array
    trace = 0.0
    for i in range(a.shape[0]):
        trace += np.tanh(a[i, i])
    return a + trace

# Measure the execution time with compilation
start = time.time()
go_fast(x)
end = time.time()
print(f"Elapsed time (with compilation) = {end - start:.6f} seconds")

# Measure the execution time after compilation (from cache)
start = time.time()
go_fast(x)
end = time.time()
print(f"Elapsed time (after compilation) = {end - start:.6f} seconds")
Elapsed time (with compilation) = 0.795050 seconds
Elapsed time (after compilation) = 0.121297 seconds