DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports Events Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Making Code Faster: Streamlining the Output

Ayende Rahien further continues his series of teaching you how to make your code faster.

Oren Eini user avatar by
Oren Eini
·
Nov. 27, 16 · Tutorial
Like (1)
Save
Tweet
Share
3.32K Views

Join the DZone community and get the full member experience.

Join For Free

After looking at the profiler results, I realized that we are actually spending a considerable amount of time just writing the output to a file. That didn’t really matter when our code ran in 30+ seconds, spending another 100 – 200 ms to write the results was just noise, but when our code is doing that in under a second, that a considerable cost.

I’m running this code on a different machine, so we can’t directly compare. The performance of the initial version is 38,478 ms and allocated 7,612,741 kb with peak working set of 874,660 kb.

And the speed of the latest version is 842 ms and allocated 208,435 kb with peak working set of 375,452 kb.

So, we are 45 times faster than the initial version.

The problem is that doing this in parallel takes quite a lot and mask some inefficiencies, so I decided to change it back to using a single threaded approach. Which gives 1,498 ms and allocated 123,787 kb with peak working set of 319,436 kb.

Merely 25 times faster than the original version.

And now let's focus on the output.

private static void WriteOutput(Dictionary<int, FastRecord> stats)
{
using (var output = File.CreateText("summary.txt"))
{
foreach (var entry in stats)
{
output.WriteLine($"{entry.Value.Id:D10} {TimeSpan.FromTicks(entry.Value.DurationInTicks):c}");
}
}
}

This is pretty simple code, but it hides a lot of inefficiencies, in particular, it is doing a lot of allocations as it format the string. We can do much better.

Merely changing the WriteLine to:

output.WriteLine($"{entry.Value.Id} {entry.Value.DurationInTicks}");

Saved us close to 200 ms (!), so there is a lot of space to improve here. Again, this is mostly an issue of creating highly specific code to solve this exact scenario. Here is what I did:

private static void WriteOutput(Dictionary<int, FastRecord> stats)
{
byte[] tempBuffer = new byte[21];
tempBuffer[10] = (byte)' ';
tempBuffer[13] = (byte)':';
tempBuffer[16] = (byte)':';
tempBuffer[19] = (byte)'\r';
tempBuffer[20] = (byte)'\n';
using (var output = File.Create("summary.txt"))
{
foreach (var entry in stats)
{
WriteFormattedInt(entry.Value.Id, tempBuffer, 0, 10);
var ts = TimeSpan.FromTicks(entry.Value.DurationInTicks);
WriteFormattedInt(ts.Hours, tempBuffer, 11, 2);
WriteFormattedInt(ts.Minutes, tempBuffer, 14, 2);
WriteFormattedInt(ts.Seconds, tempBuffer, 17, 2);
output.Write(tempBuffer, 0, tempBuffer.Length);
}
}
}
​
private static void WriteFormattedInt(long id, byte[] temp, int pos, int numberOfDigits)
{
int i = pos + numberOfDigits - 1;
do
{
temp[i--] = (byte)((id % 10) + '0');
} while ((id /= 10) > 0);
while (i >= pos)
{
temp[i--] = (byte)'0';
}
}
​
private static void WriteFormattedInt(int id, byte[] temp, int pos, int numberOfDigits)
{
int i = pos + numberOfDigits-1;
do
{
temp[i--] = (byte)((id % 10) + '0');
} while ((id /= 10) > 0);
while (i >= pos)
{
temp[i--] = (byte)'0';
}
}

I wrote a simple function to format the number into a buffer, then change the summary line to write a single line into a prepared buffer (and skip all the static stuff), and write the to the file file in one shot.

private static void WriteOutput(Dictionary<int, FastRecord> stats)
{
byte[] tempBuffer = new byte[21];
tempBuffer[10] = (byte)' ';
tempBuffer[13] = (byte)':';
tempBuffer[16] = (byte)':';
tempBuffer[19] = (byte)'\r';
tempBuffer[20] = (byte)'\n';
using (var output = File.Create("summary.txt"))
{
foreach (var entry in stats)
{
WriteFormattedInt(entry.Value.Id, tempBuffer, 0, 10);
var ts = TimeSpan.FromTicks(entry.Value.DurationInTicks);
WriteFormattedInt(ts.Hours, tempBuffer, 11, 2);
WriteFormattedInt(ts.Minutes, tempBuffer, 14, 2);
WriteFormattedInt(ts.Seconds, tempBuffer, 17, 2);
output.Write(tempBuffer, 0, tempBuffer.Length);
}
}
}
​
private static void WriteFormattedInt(long id, byte[] temp, int pos, int numberOfDigits)
{
int i = pos + numberOfDigits - 1;
do
{
temp[i--] = (byte)((id % 10) + '0');
} while ((id /= 10) > 0);
while (i >= pos)
{
temp[i--] = (byte)'0';
}
}
​
private static void WriteFormattedInt(int id, byte[] temp, int pos, int numberOfDigits)
{
int i = pos + numberOfDigits-1;
do
{
temp[i--] = (byte)((id % 10) + '0');
} while ((id /= 10) > 0);
while (i >= pos)
{
temp[i--] = (byte)'0';
}
}

And the results are 1,191 ms and allocated 16,942 kb with peak working set of 311,432 kb.

You might have noticed that I have two copies of the WriteFormattedInt, this is to skip the implicit cast to long, and yes, it matters, by about 50 ms in my tests. But this version also reduces the number of allocations we have by over 100 MB! So this is great.

And here are the profiler results on analyzing this method:

image

This function is now almost seven times faster! That is pretty awesome, and even talking about single threaded performance, we are looking at 32 times better than the original version.

Trying the parallel version give me 731 ms and allocated 101,565 kb with peak working set of 381,224 kb

And a total improvement of 52 times! But we can do even more…I’ll talk about that in the next post.

Published at DZone with permission of Oren Eini, DZone MVB. See the original article here.

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Why Every Fintech Company Needs DevOps
  • PostgreSQL: Bulk Loading Data With Node.js and Sequelize
  • Key Considerations When Implementing Virtual Kubernetes Clusters
  • Top Five Tools for AI-based Test Automation

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends: