DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Enhancing Code Clarity With Python Namedtuples
  • What Is Pydantic?
  • Python Memo 2: Dictionary vs. Set
  • Understanding the Fan-Out/Fan-In API Integration Pattern

Trending

  • Unmasking Entity-Based Data Masking: Best Practices 2025
  • Apache Doris vs Elasticsearch: An In-Depth Comparative Analysis
  • Solid Testing Strategies for Salesforce Releases
  • Contextual AI Integration for Agile Product Teams
  1. DZone
  2. Coding
  3. Languages
  4. Optimization of I/O Workloads by Profiling in Python

Optimization of I/O Workloads by Profiling in Python

Optimizing the Python code to run faster.

By 
Anandaganesh Balakrishnan user avatar
Anandaganesh Balakrishnan
·
Dec. 24, 23 · Tutorial
Likes (4)
Comment
Save
Tweet
Share
6.1K Views

Join the DZone community and get the full member experience.

Join For Free

Optimizing I/O workloads in Python typically involves understanding where the bottlenecks are and then applying strategies to reduce or manage these bottlenecks. Profiling is a crucial step in this process as it helps identify the parts of the code that are most resource-intensive. Here's a step-by-step guide to optimizing I/O workloads by profiling in Python:

Identify the I/O Workloads

Comprehending the type of your I/O workloads is essential as a first step. Do they involve disk I/O, such as file read/write operations, network I/O, which includes data transmission over a network, or database I/O, comprising database interactions? Distinct optimization techniques apply to each category. I have taken up the I/O bottlenecks related to Network and file read/write operations for this article.

Use Profiling Tools

There are several tools available for profiling Python code:

cProfile:

cProfile is the most commonly used profiler in Python. cProfile is generally advised for most users due to being a C extension with manageable overhead, making it appropriate for profiling programs that run for extended periods. It is used widely for several reasons.

Built-in and Standard: cProfile is a part of the Python Standard Library, which means it's readily available in any standard Python installation without additional packages.

Low Overhead: As a C extension, cProfile introduces relatively low overhead compared to some pure Python profilers. This feature makes it suitable for profiling longer-running applications where the profiler's impact on performance is a concern.

General-purpose Profiling: cProfile is suitable for most profiling needs, balancing detail and usability. It gives you a function-by-function breakdown of execution time, mainly needed to identify performance bottlenecks.

Wide Acceptance and Community Support: Given its status as part of the standard library and ease of use, cProfile has a broad user base and community support.

While cProfile is the most commonly used, it's important to note that the best profiler for a given task can depend on the project's specific needs. For instance, line_profiler is preferred for line-by-line analysis, and memory_profiler is used when memory usage is the primary concern. The choice of a profiler often depends on the specific aspect of the application you want to optimize.

Line_profiler:

Line_Profiler is a tool in Python that provides line-by-line profiling of your code, allowing you to see the performance of each line. This level of granularity can be beneficial when you're trying to optimize your code and need to understand where the bottlenecks are.

Memory_profiler:

This profiler is helpful if you suspect that memory usage is related to I/O inefficiency.

 3. Analyze the Profile Data

After running the profiler, analyze the data to find where most of the time is spent. Generally, the profiling output will indicate Long-running I/O operations, Repeated I/O operations that could be batched, and Unnecessary I/O operations that could be eliminated.

 4. Apply Optimization Strategies

Based on the findings, you can apply different strategies:

Caching: Store data in memory to avoid repeated I/O operations.

Batching: Combine multiple I/O operations into a single one to reduce overhead.

Asynchronous I/O: Use asyncio or other asynchronous programming techniques to perform I/O operations without blocking the main thread.

Buffering: Use buffers to reduce the number of I/O calls, especially for disk I/O.

Data Compression: Reducing the size of the data being read or written can improve I/O performance, particularly for network and disk I/O.

Parallelism: Use multi-threading or multi-processing to perform I/O operations in parallel, especially when dealing with network I/O.

5. Test and Iterate

After applying optimizations, profile your code again to see the impact.

Continue the below process iteratively:

  • Optimize
  • Profile
  • Analyze
  • Change

 6. Other Considerations

Ensure that your hardware is not a limiting factor. For database I/O, look into optimizing your database queries and indices. For file I/O, consider the file system and hardware it's running on.

7. Documentation and Community Resources

Read the documentation of the profiling tools you use for more detailed guidance. Engage with Python communities or forums for specific advice and best practices. Remember, optimization is often about trade-offs, and focusing on the parts of your code that will yield the most significant improvements is essential.

Weather Station Data Analysis and Profiling

I have taken an example of analyzing weather station data. The weather station records the hourly temperature and has the columns below.

Plain Text
 
"STATION","DATE","SOURCE","LATITUDE","LONGITUDE","ELEVATION","NAME","REPORT_TYPE","CALL_SIGN","QUALITY_CONTROL","WND","CIG","VIS","TMP","DEW","SLP","AA1","AA2","AA3","AJ1","KA1","KA2","OC1","OD1","OD2","REM"

 

Station and Tmp are needed for our analysis of all the columns.

 I am doing the below steps.

  1. Create a Python program that accepts the parameters (station list (separated by comma), years (start and end year separated by hyphen))
  2. Download the weather station data as a CSV
  3. Parse the CSV and get all the temperatures for the station list and the year range provided in the parameters.
  4. Find the maximum, minimum, and average temperatures for the stations for the year range.
  5. Profile the code
  6. Analyze the I/O bottleneck.
  7. Implement local caching
  8. Analyze the output and the runtime.

Code Without Local Cache

This program downloads the weather data for the specified stations and calculates low and high weather for the given year:

Python
 
import csv
import sys
import requests
import collections
from statistics import mean


# This function downloads the weather data for station/year and write the output as a csv file
 def download_weather_station_data(station, year):
my_url = generic_url.format(station=station, year=year)
req = requests.get(my_url)
if req.status_code != 200:
return

 with open(generic_file.format(station=station, year=year), 'w') as sf:
sf.write(req.text)


# This parent function downloads the weather data for the given station list and year range
 def download_all_weather_station_data(stations_list, start_year, end_year):
for station in stations_list:
for year in range(start_year, end_year + 1):
download_weather_station_data(station, year)


# This function gets the temperature details from the file
 def get_file_temperature(file_name):
with open(file_name, 'r') as tf:
reader = csv.reader(tf)
header = next(reader)

for row in reader:
station = row[header.index("STATION")]
temp = row[header.index("TMP")]
temperature, status = temp.split(",")
if int(status) != 1:
continue
             temperature = int(temperature) / 10

yield temperature


# This parent function gets all the temperatures for the given station and year
 def get_temperatures_all(stations_list, start_year, end_year):
temperatures = collections.defaultdict(list)
for station in stations_list:
for year in range(start_year, end_year + 1):
for temperature in get_file_temperature(generic_file.format(station=station, year=year)):
temperatures[station].append(temperature)
return temperatures


# This function gets the maximum/minimum/average temperature for the station over the given years
 def get_temperatures(lst_temperatures, calc_mode):
result = {}
for mode in calc_mode:
if mode == 'max':
result[mode] = {station: max(temperatures) for station, temperatures in lst_temperatures.items()}
elif mode == 'min':
result[mode] = {station: min(temperatures) for station, temperatures in lst_temperatures.items()}
else:
result[mode] = {station: mean(temperatures) for station, temperatures in lst_temperatures.items()}
return result


# Main Function
 if __name__ := "__main__":
stations = sys.argv[1].split(",")
years = [int(year) for year in sys.argv[2].split("-")]
first_year = years[0]
last_year = years[1]

generic_url = "https://www.ncei.noaa.gov/data/global-hourly/access/{year}/{station}.csv"
     generic_file = "Weather_station_{station}_{year}.csv"

     download_all_weather_station_data(stations, first_year, last_year)
temperatures_all = get_temperatures_all(stations, first_year, last_year)
temperatures_values = get_temperatures(temperatures_all, ['max', 'min', 'avg'])

print(f"The temperatures are {temperatures_values}")


Executed the code and got the desired output.

python load_weather_data.py "01480099999,02110099999,02243099999" 2018-2023

Output:

Plain Text
 
The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.145012712693135, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}


Analyze the Code With CProfile:

python -m cProfile -s cumulative load_weather_data.py "01480099999,02110099999,02243099999" 2018-2023 > load_weather_data_profile.txt

Shell
 
 The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.1538004828081165, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}

         1422783 function calls (1416758 primitive calls) in 17.250 seconds

   Ordered by: cumulative time

ncalls   tottime  percall  cumtime   percall filename:lineno(function)

    181/1    0.002     0.000   17.250   17.250 {built-in method builtins.exec}

        1    0.000     0.000   17.250   17.250 load_weather_data.py:1(<module>)

        1    0.003     0.003   16.241   16.241 load_weather_data.py:23(download_all_weather_station_data)

       18    0.003     0.000   16.221    0.901 load_weather_data.py:12(download_weather_station_data)

 


The function call download_all_weather_station_data has taken the most run time, leaving the scope to optimize I/O.

Since the data is static, there is no need to generate the CSV files again once generated.

The program below is optimized not to generate the files again if they have already been generated.

Python
 
"""This program downloads the weather data for the specified stations
 and calculates low and high weather for the given year"""
 
 import os
 import csv
 import sys
 import fnmatch
 import requests
 import collections
 from statistics import mean
 
 
 # This function downloads the weather data for station/year and write the output as a csv file
 def download_weather_station_data(station, year):
     my_url = generic_url.format(station=station, year=year)
     req = requests.get(my_url)
     if req.status_code != 200:
         return
 
     with open(generic_file.format(station=station, year=year), 'w') as sf:
         sf.write(req.text)
 
 
 # This parent function downloads the weather data for the given station list and year range
 def download_all_weather_station_data(stations_list, start_year, end_year):
     for station in stations_list:
         for year in range(start_year, end_year + 1):
             if not os.path.exists(generic_file.format(station=station, year=year)):
                 download_weather_station_data(station, year)
 
 
 # This function gets the temperature details from the file
 def get_file_temperature(file_name):
     with open(file_name, 'r') as tf:
         reader = csv.reader(tf)
         header = next(reader)
 
         for row in reader:
             station = row[header.index("STATION")]
             temp = row[header.index("TMP")]
             temperature, status = temp.split(",")
             if int(status) != 1:
                 continue
             temperature = int(temperature) / 10
 
             yield temperature
 
 
 # This parent function gets all the temperatures for the given station and year
 def get_temperatures_all(stations_list, start_year, end_year):
     temperatures = collections.defaultdict(list)
     for station in stations_list:
         for year in range(start_year, end_year + 1):
             if os.path.exists(generic_file.format(station=station, year=year)):
                 for temperature in get_file_temperature(generic_file.format(station=station, year=year)):
                     temperatures[station].append(temperature)
     return temperatures
 
 
 # This function gets the maximum/minimum/average temperature for the station over the given years
 def get_temperatures(lst_temperatures, calc_mode):
     result = {}
     for mode in calc_mode:
         if mode == 'max':
             result[mode] = {station: max(temperatures) for station, temperatures in lst_temperatures.items()}
         elif mode == 'min':
             result[mode] = {station: min(temperatures) for station, temperatures in lst_temperatures.items()}
         else:
             result[mode] = {station: mean(temperatures) for station, temperatures in lst_temperatures.items()}
     return result
 
 
 # Main Function
 if __name__ := "__main__":
     stations = sys.argv[1].split(",")
     years = [int(year) for year in sys.argv[2].split("-")]
     first_year = years[0]
     last_year = years[1]
 
     generic_url = "https://www.ncei.noaa.gov/data/global-hourly/access/{year}/{station}.csv"
     generic_file = "Weather_station_{station}_{year}.csv"
     current_directory = os.getcwd()
 
     download_all_weather_station_data(stations, first_year, last_year)
 
     count = len(fnmatch.filter(os.listdir(current_directory), '*.csv'))
 
     if count > 0:
         temperatures_all = get_temperatures_all(stations, first_year, last_year)
         temperatures_values = get_temperatures(temperatures_all, ['max', 'min', 'avg'])
         print(f"The temperatures are {temperatures_values}")
     else:
         print(f"There are no file(s) available for the given stations {sys.argv[1]} and years {sys.argv[2]}")


Executed the code and got the desired output.

python load_weather_data_cache.py "01480099999,02110099999,02243099999" 2018-2023                                                                 

Result:

Shell
 
The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.145012712693135, '02110099999': 0.2386

3829994401306, '02243099999': 3.383049058515579}}


Analyze the Code With CProfile:

python -m cProfile -s cumulative load_weather_data_cache.py "01480099999,02110099999,02243099999" 2018-2023 > load_weather_data_cache_profile.txt

Shell
 
The temperatures are {'max': {'01480099999': 33.5, '02110099999': 29.6, '02243099999': 32.0}, 'min': {'01480099999': -20.4, '02110099999': -39.5, '02243099999': -32.1}, 'avg': {'01480099999': 7.1538004828081165, '02110099999': 0.23863829994401306, '02243099999': 3.383049058515579}}

         1142084 function calls (1136170 primitive calls) in 1.005 seconds

 

   Ordered by: cumulative time

    ncalls  tottime   percall  cumtime  percall filename:lineno(function)

    179/1    0.002     0.000    1.005    1.005 {built-in method builtins.exec}

        1    0.000     0.000    1.005    1.005 load_weather_data_cache.py:1(<module>)

        1    0.040     0.040    0.670    0.670 load_weather_data_cache.py:50(get_temperatures_all)

   119125    0.516     0.000    0.619    0.000 load_weather_data_cache.py:33(get_file_temperature)

       17    0.000     0.000    0.367    0.022 __init__.py:1(<module>)


The function call download_all_weather_station_data is not appearing anymore as the most run time. The overall runtime decreased by approximately 16 times.

Conclusion

Caches, as demonstrated in this example, have the potential to accelerate code by several orders of magnitude. However, managing caches can be challenging and often leads to bugs. In the given example, the files remain static over time, but it's worth noting that there are numerous scenarios in which the cached data could change. In such situations, the code responsible for cache management must be capable of identifying and addressing these changes.

IT Data (computing) optimization Python (language) Data profiling

Opinions expressed by DZone contributors are their own.

Related

  • Enhancing Code Clarity With Python Namedtuples
  • What Is Pydantic?
  • Python Memo 2: Dictionary vs. Set
  • Understanding the Fan-Out/Fan-In API Integration Pattern

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!