Optimization of I/O Workloads by Profiling in Python

Optimizing the Python code to run faster.

Dec. 24, 23 · Tutorial

Likes (4)

Comment

Save

6.1K Views

Optimizing I/O workloads in Python typically involves understanding where the bottlenecks are and then applying strategies to reduce or manage these bottlenecks. Profiling is a crucial step in this process as it helps identify the parts of the code that are most resource-intensive. Here's a step-by-step guide to optimizing I/O workloads by profiling in Python:

Identify the I/O Workloads

Comprehending the type of your I/O workloads is essential as a first step. Do they involve disk I/O, such as file read/write operations, network I/O, which includes data transmission over a network, or database I/O, comprising database interactions? Distinct optimization techniques apply to each category. I have taken up the I/O bottlenecks related to Network and file read/write operations for this article.

Use Profiling Tools

There are several tools available for profiling Python code:

cProfile:

cProfile is the most commonly used profiler in Python. cProfile is generally advised for most users due to being a C extension with manageable overhead, making it appropriate for profiling programs that run for extended periods. It is used widely for several reasons.

Built-in and Standard: cProfile is a part of the Python Standard Library, which means it's readily available in any standard Python installation without additional packages.

Low Overhead: As a C extension, cProfile introduces relatively low overhead compared to some pure Python profilers. This feature makes it suitable for profiling longer-running applications where the profiler's impact on performance is a concern.

General-purpose Profiling: cProfile is suitable for most profiling needs, balancing detail and usability. It gives you a function-by-function breakdown of execution time, mainly needed to identify performance bottlenecks.

Wide Acceptance and Community Support: Given its status as part of the standard library and ease of use, cProfile has a broad user base and community support.

While cProfile is the most commonly used, it's important to note that the best profiler for a given task can depend on the project's specific needs. For instance, line_profiler is preferred for line-by-line analysis, and memory_profiler is used when memory usage is the primary concern. The choice of a profiler often depends on the specific aspect of the application you want to optimize.

Line_profiler:

Line_Profiler is a tool in Python that provides line-by-line profiling of your code, allowing you to see the performance of each line. This level of granularity can be beneficial when you're trying to optimize your code and need to understand where the bottlenecks are.

Memory_profiler:

This profiler is helpful if you suspect that memory usage is related to I/O inefficiency.

3. Analyze the Profile Data

After running the profiler, analyze the data to find where most of the time is spent. Generally, the profiling output will indicate Long-running I/O operations, Repeated I/O operations that could be batched, and Unnecessary I/O operations that could be eliminated.

4. Apply Optimization Strategies

Based on the findings, you can apply different strategies:

Caching: Store data in memory to avoid repeated I/O operations.

Batching: Combine multiple I/O operations into a single one to reduce overhead.

Asynchronous I/O: Use asyncio or other asynchronous programming techniques to perform I/O operations without blocking the main thread.

Buffering: Use buffers to reduce the number of I/O calls, especially for disk I/O.

Data Compression: Reducing the size of the data being read or written can improve I/O performance, particularly for network and disk I/O.

Parallelism: Use multi-threading or multi-processing to perform I/O operations in parallel, especially when dealing with network I/O.

5. Test and Iterate

After applying optimizations, profile your code again to see the impact.

Continue the below process iteratively:

Optimize
Profile
Analyze
Change

6. Other Considerations

Ensure that your hardware is not a limiting factor. For database I/O, look into optimizing your database queries and indices. For file I/O, consider the file system and hardware it's running on.

7. Documentation and Community Resources

Read the documentation of the profiling tools you use for more detailed guidance. Engage with Python communities or forums for specific advice and best practices. Remember, optimization is often about trade-offs, and focusing on the parts of your code that will yield the most significant improvements is essential.

Weather Station Data Analysis and Profiling

I have taken an example of analyzing weather station data. The weather station records the hourly temperature and has the columns below.

     Plain Text 
   
   "STATION","DATE","SOURCE","LATITUDE","LONGITUDE","ELEVATION","NAME","REPORT_TYPE","CALL_SIGN","QUALITY_CONTROL","WND","CIG","VIS","TMP","DEW","SLP","AA1","AA2","AA3","AJ1","KA1","KA2","OC1","OD1","OD2","REM"

Station and Tmp are needed for our analysis of all the columns.

I am doing the below steps.

Create a Python program that accepts the parameters (station list (separated by comma), years (start and end year separated by hyphen))
Download the weather station data as a CSV
Parse the CSV and get all the temperatures for the station list and the year range provided in the parameters.
Find the maximum, minimum, and average temperatures for the stations for the year range.
Profile the code
Analyze the I/O bottleneck.
Implement local caching
Analyze the output and the runtime.

Code Without Local Cache

This program downloads the weather data for the specified stations and calculates low and high weather for the given year:

     Python 
   
 
 
   import csv
import sys
import requests
import collections
from statistics import mean


# This function downloads the weather data for station/year and write the output as a csv file
 def download_weather_station_data(station, year):
my_url = generic_url.format(station=station, year=year)
req = requests.get(my_url)
if req.status_code != 200:
return

 with open(generic_file.format(station=station, year=year), 'w') as sf:
sf.write(req.text)


# This parent function downloads the weather data for the given station list and year range
 def download_all_weather_station_data(stations_list, start_year, end_year):
for station in stations_list:
for year in range(start_year, end_year + 1):
download_weather_station_data(station, year)


# This function gets the temperature details from the file
 def get_file_temperature(file_name):
with open(file_name, 'r') as tf:
reader = csv.reader(tf)
header = next(reader)

for row in reader:
station = row[header.index("STATION")]
temp = row[header.index("TMP")]
temperature, status = temp.split(",")
if int(status) != 1:
continue
             temperature = int(temperature) / 10

yield temperature


# This parent function gets all the temperatures for the given station and year
 def get_temperatures_all(stations_list, start_year, end_year):
temperatures = collections.defaultdict(list)
for station in stations_list:
for year in range(start_year, end_year + 1):
for temperature in get_file_temperature(generic_file.format(station=station, year=year)):
temperatures[station].append(temperature)
return temperatures


# This function gets the maximum/minimum/average temperature for the station over the given years
 def get_temperatures(lst_temperatures, calc_mode):
result = {}
for mode in calc_mode:
if mode == 'max':
result[mode] = {station: max(temperatures) for station, temperatures in lst_temperatures.items()}
elif mode == 'min':
result[mode] = {station: min(temperatures) for station, temperatures in lst_temperatures.items()}
else:
result[mode] = {station: mean(temperatures) for station, temperatures in lst_temperatures.items()}
return result


# Main Function
 if __name__ := "__main__":
stations = sys.argv[1].split(",")
years = [int(year) for year in sys.argv[2].split("-")]
first_year = years[0]
last_year = years[1]

generic_url = "https://www.ncei.noaa.gov/data/global-hourly/access/{year}/{station}.csv"
     generic_file = "Weather_station_{station}_{year}.csv"

     download_all_weather_station_data(stations, first_year, last_year)
temperatures_all = get_temperatures_all(stations, first_year, last_year)
temperatures_values = get_temperatures(temperatures_all, ['max', 'min', 'avg'])

print(f"The temperatures are {temperatures_values}") 
  

Executed the code and got the desired output.

python load_weather_data.py "01480099999,02110099999,02243099999" 2018-2023

Output:

     Plain Text 
   
   The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.145012712693135, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}

Analyze the Code With CProfile:

python -m cProfile -s cumulative load_weather_data.py "01480099999,02110099999,02243099999" 2018-2023 > load_weather_data_profile.txt

     Shell 
   
    The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.1538004828081165, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}

         1422783 function calls (1416758 primitive calls) in 17.250 seconds

   Ordered by: cumulative time

ncalls   tottime  percall  cumtime   percall filename:lineno(function)

    181/1    0.002     0.000   17.250   17.250 {built-in method builtins.exec}

        1    0.000     0.000   17.250   17.250 load_weather_data.py:1(<module>)

        1    0.003     0.003   16.241   16.241 load_weather_data.py:23(download_all_weather_station_data)

       18    0.003     0.000   16.221    0.901 load_weather_data.py:12(download_weather_station_data)

The function call download_all_weather_station_data has taken the most run time, leaving the scope to optimize I/O.

Since the data is static, there is no need to generate the CSV files again once generated.

The program below is optimized not to generate the files again if they have already been generated.

     Python 
   
 
 
   """This program downloads the weather data for the specified stations
 and calculates low and high weather for the given year"""
 
 import os
 import csv
 import sys
 import fnmatch
 import requests
 import collections
 from statistics import mean
 
 
 # This function downloads the weather data for station/year and write the output as a csv file
 def download_weather_station_data(station, year):
     my_url = generic_url.format(station=station, year=year)
     req = requests.get(my_url)
     if req.status_code != 200:
         return
 
     with open(generic_file.format(station=station, year=year), 'w') as sf:
         sf.write(req.text)
 
 
 # This parent function downloads the weather data for the given station list and year range
 def download_all_weather_station_data(stations_list, start_year, end_year):
     for station in stations_list:
         for year in range(start_year, end_year + 1):
             if not os.path.exists(generic_file.format(station=station, year=year)):
                 download_weather_station_data(station, year)
 
 
 # This function gets the temperature details from the file
 def get_file_temperature(file_name):
     with open(file_name, 'r') as tf:
         reader = csv.reader(tf)
         header = next(reader)
 
         for row in reader:
             station = row[header.index("STATION")]
             temp = row[header.index("TMP")]
             temperature, status = temp.split(",")
             if int(status) != 1:
                 continue
             temperature = int(temperature) / 10
 
             yield temperature
 
 
 # This parent function gets all the temperatures for the given station and year
 def get_temperatures_all(stations_list, start_year, end_year):
     temperatures = collections.defaultdict(list)
     for station in stations_list:
         for year in range(start_year, end_year + 1):
             if os.path.exists(generic_file.format(station=station, year=year)):
                 for temperature in get_file_temperature(generic_file.format(station=station, year=year)):
                     temperatures[station].append(temperature)
     return temperatures
 
 
 # This function gets the maximum/minimum/average temperature for the station over the given years
 def get_temperatures(lst_temperatures, calc_mode):
     result = {}
     for mode in calc_mode:
         if mode == 'max':
             result[mode] = {station: max(temperatures) for station, temperatures in lst_temperatures.items()}
         elif mode == 'min':
             result[mode] = {station: min(temperatures) for station, temperatures in lst_temperatures.items()}
         else:
             result[mode] = {station: mean(temperatures) for station, temperatures in lst_temperatures.items()}
     return result
 
 
 # Main Function
 if __name__ := "__main__":
     stations = sys.argv[1].split(",")
     years = [int(year) for year in sys.argv[2].split("-")]
     first_year = years[0]
     last_year = years[1]
 
     generic_url = "https://www.ncei.noaa.gov/data/global-hourly/access/{year}/{station}.csv"
     generic_file = "Weather_station_{station}_{year}.csv"
     current_directory = os.getcwd()
 
     download_all_weather_station_data(stations, first_year, last_year)
 
     count = len(fnmatch.filter(os.listdir(current_directory), '*.csv'))
 
     if count > 0:
         temperatures_all = get_temperatures_all(stations, first_year, last_year)
         temperatures_values = get_temperatures(temperatures_all, ['max', 'min', 'avg'])
         print(f"The temperatures are {temperatures_values}")
     else:
         print(f"There are no file(s) available for the given stations {sys.argv[1]} and years {sys.argv[2]}") 
  

Executed the code and got the desired output.

python load_weather_data_cache.py "01480099999,02110099999,02243099999" 2018-2023

Result:

     Shell 
   
   The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.145012712693135, '02110099999': 0.2386

3829994401306, '02243099999': 3.383049058515579}}

Analyze the Code With CProfile:

python -m cProfile -s cumulative load_weather_data_cache.py "01480099999,02110099999,02243099999" 2018-2023 > load_weather_data_cache_profile.txt

     Shell 
   
   The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.1538004828081165, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}

         1142084 function calls (1136170 primitive calls) in 1.005 seconds

   Ordered by: cumulative time

    ncalls  tottime   percall  cumtime  percall filename:lineno(function)

    179/1    0.002     0.000    1.005    1.005 {built-in method builtins.exec}

        1    0.000     0.000    1.005    1.005 load_weather_data_cache.py:1(<module>)

        1    0.040     0.040    0.670    0.670 load_weather_data_cache.py:50(get_temperatures_all)

   119125    0.516     0.000    0.619    0.000 load_weather_data_cache.py:33(get_file_temperature)

       17    0.000     0.000    0.367    0.022 __init__.py:1(<module>)

The function call download_all_weather_station_data is not appearing anymore as the most run time. The overall runtime decreased by approximately 16 times.

Conclusion

Caches, as demonstrated in this example, have the potential to accelerate code by several orders of magnitude. However, managing caches can be challenging and often leads to bugs. In the given example, the files remain static over time, but it's worth noting that there are numerous scenarios in which the cached data could change. In such situations, the code responsible for cache management must be capable of identifying and addressing these changes.

IT Data (computing) optimization Python (language) Data profiling

Opinions expressed by DZone contributors are their own.

Related

Trending

Optimization of I/O Workloads by Profiling in Python

Optimizing the Python code to run faster.

Identify the I/O Workloads

Use Profiling Tools

cProfile:

Line_profiler:

Memory_profiler:

3. Analyze the Profile Data

4. Apply Optimization Strategies

5. Test and Iterate

6. Other Considerations

7. Documentation and Community Resources

Weather Station Data Analysis and Profiling

Code Without Local Cache

Analyze the Code With CProfile:

Analyze the Code With CProfile:

Conclusion

Related

Partner Resources