Optimization of I/O Workloads by Profiling in Python
Optimizing the Python code to run faster.
Join the DZone community and get the full member experience.
Join For FreeOptimizing I/O workloads in Python typically involves understanding where the bottlenecks are and then applying strategies to reduce or manage these bottlenecks. Profiling is a crucial step in this process as it helps identify the parts of the code that are most resource-intensive. Here's a step-by-step guide to optimizing I/O workloads by profiling in Python:
Identify the I/O Workloads
Comprehending the type of your I/O workloads is essential as a first step. Do they involve disk I/O, such as file read/write operations, network I/O, which includes data transmission over a network, or database I/O, comprising database interactions? Distinct optimization techniques apply to each category. I have taken up the I/O bottlenecks related to Network and file read/write operations for this article.
Use Profiling Tools
There are several tools available for profiling Python code:
cProfile:
cProfile is the most commonly used profiler in Python. cProfile is generally advised for most users due to being a C extension with manageable overhead, making it appropriate for profiling programs that run for extended periods. It is used widely for several reasons.
Built-in and Standard: cProfile is a part of the Python Standard Library, which means it's readily available in any standard Python installation without additional packages.
Low Overhead: As a C extension, cProfile introduces relatively low overhead compared to some pure Python profilers. This feature makes it suitable for profiling longer-running applications where the profiler's impact on performance is a concern.
General-purpose Profiling: cProfile is suitable for most profiling needs, balancing detail and usability. It gives you a function-by-function breakdown of execution time, mainly needed to identify performance bottlenecks.
Wide Acceptance and Community Support: Given its status as part of the standard library and ease of use, cProfile has a broad user base and community support.
While cProfile is the most commonly used, it's important to note that the best profiler for a given task can depend on the project's specific needs. For instance, line_profiler is preferred for line-by-line analysis, and memory_profiler is used when memory usage is the primary concern. The choice of a profiler often depends on the specific aspect of the application you want to optimize.
Line_profiler:
Line_Profiler is a tool in Python that provides line-by-line profiling of your code, allowing you to see the performance of each line. This level of granularity can be beneficial when you're trying to optimize your code and need to understand where the bottlenecks are.
Memory_profiler:
This profiler is helpful if you suspect that memory usage is related to I/O inefficiency.
3. Analyze the Profile Data
After running the profiler, analyze the data to find where most of the time is spent. Generally, the profiling output will indicate Long-running I/O operations, Repeated I/O operations that could be batched, and Unnecessary I/O operations that could be eliminated.
4. Apply Optimization Strategies
Based on the findings, you can apply different strategies:
Caching: Store data in memory to avoid repeated I/O operations.
Batching: Combine multiple I/O operations into a single one to reduce overhead.
Asynchronous I/O: Use asyncio or other asynchronous programming techniques to perform I/O operations without blocking the main thread.
Buffering: Use buffers to reduce the number of I/O calls, especially for disk I/O.
Data Compression: Reducing the size of the data being read or written can improve I/O performance, particularly for network and disk I/O.
Parallelism: Use multi-threading or multi-processing to perform I/O operations in parallel, especially when dealing with network I/O.
5. Test and Iterate
After applying optimizations, profile your code again to see the impact.
Continue the below process iteratively:
- Optimize
- Profile
- Analyze
- Change
6. Other Considerations
Ensure that your hardware is not a limiting factor. For database I/O, look into optimizing your database queries and indices. For file I/O, consider the file system and hardware it's running on.
7. Documentation and Community Resources
Read the documentation of the profiling tools you use for more detailed guidance. Engage with Python communities or forums for specific advice and best practices. Remember, optimization is often about trade-offs, and focusing on the parts of your code that will yield the most significant improvements is essential.
Weather Station Data Analysis and Profiling
I have taken an example of analyzing weather station data. The weather station records the hourly temperature and has the columns below.
"STATION","DATE","SOURCE","LATITUDE","LONGITUDE","ELEVATION","NAME","REPORT_TYPE","CALL_SIGN","QUALITY_CONTROL","WND","CIG","VIS","TMP","DEW","SLP","AA1","AA2","AA3","AJ1","KA1","KA2","OC1","OD1","OD2","REM"
Station and Tmp are needed for our analysis of all the columns.
I am doing the below steps.
- Create a Python program that accepts the parameters (station list (separated by comma), years (start and end year separated by hyphen))
- Download the weather station data as a CSV
- Parse the CSV and get all the temperatures for the station list and the year range provided in the parameters.
- Find the maximum, minimum, and average temperatures for the stations for the year range.
- Profile the code
- Analyze the I/O bottleneck.
- Implement local caching
- Analyze the output and the runtime.
Code Without Local Cache
This program downloads the weather data for the specified stations and calculates low and high weather for the given year:
import csv
import sys
import requests
import collections
from statistics import mean
# This function downloads the weather data for station/year and write the output as a csv file
def download_weather_station_data(station, year):
my_url = generic_url.format(station=station, year=year)
req = requests.get(my_url)
if req.status_code != 200:
return
with open(generic_file.format(station=station, year=year), 'w') as sf:
sf.write(req.text)
# This parent function downloads the weather data for the given station list and year range
def download_all_weather_station_data(stations_list, start_year, end_year):
for station in stations_list:
for year in range(start_year, end_year + 1):
download_weather_station_data(station, year)
# This function gets the temperature details from the file
def get_file_temperature(file_name):
with open(file_name, 'r') as tf:
reader = csv.reader(tf)
header = next(reader)
for row in reader:
station = row[header.index("STATION")]
temp = row[header.index("TMP")]
temperature, status = temp.split(",")
if int(status) != 1:
continue
temperature = int(temperature) / 10
yield temperature
# This parent function gets all the temperatures for the given station and year
def get_temperatures_all(stations_list, start_year, end_year):
temperatures = collections.defaultdict(list)
for station in stations_list:
for year in range(start_year, end_year + 1):
for temperature in get_file_temperature(generic_file.format(station=station, year=year)):
temperatures[station].append(temperature)
return temperatures
# This function gets the maximum/minimum/average temperature for the station over the given years
def get_temperatures(lst_temperatures, calc_mode):
result = {}
for mode in calc_mode:
if mode == 'max':
result[mode] = {station: max(temperatures) for station, temperatures in lst_temperatures.items()}
elif mode == 'min':
result[mode] = {station: min(temperatures) for station, temperatures in lst_temperatures.items()}
else:
result[mode] = {station: mean(temperatures) for station, temperatures in lst_temperatures.items()}
return result
# Main Function
if __name__ := "__main__":
stations = sys.argv[1].split(",")
years = [int(year) for year in sys.argv[2].split("-")]
first_year = years[0]
last_year = years[1]
generic_url = "https://www.ncei.noaa.gov/data/global-hourly/access/{year}/{station}.csv"
generic_file = "Weather_station_{station}_{year}.csv"
download_all_weather_station_data(stations, first_year, last_year)
temperatures_all = get_temperatures_all(stations, first_year, last_year)
temperatures_values = get_temperatures(temperatures_all, ['max', 'min', 'avg'])
print(f"The temperatures are {temperatures_values}")
Executed the code and got the desired output.
python load_weather_data.py "01480099999,02110099999,02243099999" 2018-2023
Output:
The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.145012712693135, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}
Analyze the Code With CProfile:
python -m cProfile -s cumulative load_weather_data.py "01480099999,02110099999,02243099999" 2018-2023 > load_weather_data_profile.txt
The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.1538004828081165, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}
1422783 function calls (1416758 primitive calls) in 17.250 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
181/1 0.002 0.000 17.250 17.250 {built-in method builtins.exec}
1 0.000 0.000 17.250 17.250 load_weather_data.py:1(<module>)
1 0.003 0.003 16.241 16.241 load_weather_data.py:23(download_all_weather_station_data)
18 0.003 0.000 16.221 0.901 load_weather_data.py:12(download_weather_station_data)
The function call download_all_weather_station_data has taken the most run time, leaving the scope to optimize I/O.
Since the data is static, there is no need to generate the CSV files again once generated.
The program below is optimized not to generate the files again if they have already been generated.
"""This program downloads the weather data for the specified stations
and calculates low and high weather for the given year"""
import os
import csv
import sys
import fnmatch
import requests
import collections
from statistics import mean
# This function downloads the weather data for station/year and write the output as a csv file
def download_weather_station_data(station, year):
my_url = generic_url.format(station=station, year=year)
req = requests.get(my_url)
if req.status_code != 200:
return
with open(generic_file.format(station=station, year=year), 'w') as sf:
sf.write(req.text)
# This parent function downloads the weather data for the given station list and year range
def download_all_weather_station_data(stations_list, start_year, end_year):
for station in stations_list:
for year in range(start_year, end_year + 1):
if not os.path.exists(generic_file.format(station=station, year=year)):
download_weather_station_data(station, year)
# This function gets the temperature details from the file
def get_file_temperature(file_name):
with open(file_name, 'r') as tf:
reader = csv.reader(tf)
header = next(reader)
for row in reader:
station = row[header.index("STATION")]
temp = row[header.index("TMP")]
temperature, status = temp.split(",")
if int(status) != 1:
continue
temperature = int(temperature) / 10
yield temperature
# This parent function gets all the temperatures for the given station and year
def get_temperatures_all(stations_list, start_year, end_year):
temperatures = collections.defaultdict(list)
for station in stations_list:
for year in range(start_year, end_year + 1):
if os.path.exists(generic_file.format(station=station, year=year)):
for temperature in get_file_temperature(generic_file.format(station=station, year=year)):
temperatures[station].append(temperature)
return temperatures
# This function gets the maximum/minimum/average temperature for the station over the given years
def get_temperatures(lst_temperatures, calc_mode):
result = {}
for mode in calc_mode:
if mode == 'max':
result[mode] = {station: max(temperatures) for station, temperatures in lst_temperatures.items()}
elif mode == 'min':
result[mode] = {station: min(temperatures) for station, temperatures in lst_temperatures.items()}
else:
result[mode] = {station: mean(temperatures) for station, temperatures in lst_temperatures.items()}
return result
# Main Function
if __name__ := "__main__":
stations = sys.argv[1].split(",")
years = [int(year) for year in sys.argv[2].split("-")]
first_year = years[0]
last_year = years[1]
generic_url = "https://www.ncei.noaa.gov/data/global-hourly/access/{year}/{station}.csv"
generic_file = "Weather_station_{station}_{year}.csv"
current_directory = os.getcwd()
download_all_weather_station_data(stations, first_year, last_year)
count = len(fnmatch.filter(os.listdir(current_directory), '*.csv'))
if count > 0:
temperatures_all = get_temperatures_all(stations, first_year, last_year)
temperatures_values = get_temperatures(temperatures_all, ['max', 'min', 'avg'])
print(f"The temperatures are {temperatures_values}")
else:
print(f"There are no file(s) available for the given stations {sys.argv[1]} and years {sys.argv[2]}")
Executed the code and got the desired output.
python load_weather_data_cache.py "01480099999,02110099999,02243099999" 2018-2023
Result:
The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.145012712693135, '02110099999': 0.2386
3829994401306, '02243099999': 3.383049058515579}}
Analyze the Code With CProfile:
python -m cProfile -s cumulative load_weather_data_cache.py "01480099999,02110099999,02243099999" 2018-2023 > load_weather_data_cache_profile.txt
The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.1538004828081165, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}
1142084 function calls (1136170 primitive calls) in 1.005 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
179/1 0.002 0.000 1.005 1.005 {built-in method builtins.exec}
1 0.000 0.000 1.005 1.005 load_weather_data_cache.py:1(<module>)
1 0.040 0.040 0.670 0.670 load_weather_data_cache.py:50(get_temperatures_all)
119125 0.516 0.000 0.619 0.000 load_weather_data_cache.py:33(get_file_temperature)
17 0.000 0.000 0.367 0.022 __init__.py:1(<module>)
The function call download_all_weather_station_data is not appearing anymore as the most run time. The overall runtime decreased by approximately 16 times.
Conclusion
Caches, as demonstrated in this example, have the potential to accelerate code by several orders of magnitude. However, managing caches can be challenging and often leads to bugs. In the given example, the files remain static over time, but it's worth noting that there are numerous scenarios in which the cached data could change. In such situations, the code responsible for cache management must be capable of identifying and addressing these changes.
Opinions expressed by DZone contributors are their own.
Comments