DZone
Performance Zone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
  • Refcardz
  • Trend Reports
  • Webinars
  • Zones
  • |
    • Agile
    • AI
    • Big Data
    • Cloud
    • Database
    • DevOps
    • Integration
    • IoT
    • Java
    • Microservices
    • Open Source
    • Performance
    • Security
    • Web Dev
DZone > Performance Zone > Distributed Network Service for Users Activity Limiting (Part 1)

Distributed Network Service for Users Activity Limiting (Part 1)

In this article, explore how it is possible to implement a stable, robust, and fail-tolerant algorithm of users limiting on distributed service.

Alexey Chashchegorov user avatar by
Alexey Chashchegorov
·
Feb. 08, 22 · Performance Zone · Tutorial
Like (3)
Save
Tweet
3.28K Views

Join the DZone community and get the full member experience.

Join For Free

Scenario

Any service provider tries to reach several metrics in their activity. One group of these metrics is service quality. Quality metrics contain:

  • The ratio of successfully processed requests
  • Distribution of processing time between requests
  • Number of requests dependent curves 

But what is the metric that shows service hardware monopolization by a group of users? Without a fair statement of this problem and the right technical solution, this metric will not exist. This metric absence reduces the quality and user satisfaction of the service.

Let’s consider the situation or hardware resources usage distribution between several users.

Users will be segregated into two groups: common users and aggressive users. Common users will generate random destination host requests with limited time to process. Aggressive users will make equal requests, but the number of requests will be high. Aggressive users can try to get speed up by using the most productive server from the service. The rest of the requests will not be used by aggressors. In the case of this strategy, service hosts can spend most of their time processing the requests of the aggressors.

Limitation Problem Statement

To avoid service resources monopolization, the most significant factor of aggressive action needs to be listed. 

  • The number of current sessions that have been opened at all of the distributed service hosts for each client

  • The number of sessions that have been open in a sliding time window at all of the distributed service hosts for each client

  • The total amount of time spent for current sessions that have been open at all of the distributed service hosts for each client

All of these factors can be used for the limitation of users' activity, and be a parameter in a metric that shows monopolization factor for user or users' group.

monopolisation = sessions / sessions_limit + (window_max_openings / openings_limit) +  

(sessions_time / sessions_time_limit)

This formula can show the initial relative metric statement to analyze and minimize it.

Limitation Problem Perspective Solutions

 There are several algorithm types to compute and block users' activity that exceed the limits: 

  • Token bucket based algorithms

  • Fixed time window counter algorithms

  • Sliding time window counter algorithms

Token Bucket Based Algorithm

Each time Quant tokens at buckets are counted, they increased by 1.  For each established connection, the tokens count decreases by 1. The number of tokens in the bucket is limited. Increasing will take place each time if the Quant tokens count is under the limit.

This algorithm helps to limit sessions' count openings in the time window in trivial use cases. A complex case on two hosts at service shows that this algorithm has problems with scalability. To share a common limit between two instances, the size of the bucket needs to be tunable at the instance side. On exceeding the limit at one host, the other host limit needs to be decreased to 0. This means that at each connection, the limit redistribution phase needs to be executed.

Token bucket on distributed network phases

Token bucket on distributed network phases

These phases will include 2 network exchanges to a common shared resource. Each session opening will need to have this long network exchange for commonly shared limit tuning. 

Fixed Time Window Counter Algorithm

At a fixed window, each instance of distributed service will get a self-detected limit. Time windows will have equal durations and will go one by one. At the end of each time window, each instance reports about metric value on instance size. This report stores a common resource (database, for example). Then the instance will get accumulated data of the metric for the rest of the instances in service. Current instance limit will be reduced by accumulated values of other instances. 

Fixed window algorithm phases

Fixed window algorithm phases

The most important benefit of this algorithm is the ability to limit the number of exchanges with common resources. Just one exchange on a tunable duration time window is needed. The downside of this algorithm is that decision making can not reflect changes which take a place inside the time window. The limit will be changed at the end of the time window only. However, it might be reasonable to change it in the middle of the window if the limit overflowed earlier. Limit overflow in the middle of the window will lead to false positive errors.

Sliding Time Window Algorithm

This is a “fixed time window counter” algorithm improvement. It has the same well-scaled number of exchanges defined by time window. However, the limit at each instance of the algorithm is flexible inside the time window. Its trivial case limit reducing factor is the number of aggregated values from other instances multiplied to factor reduces linear from beginning to the end of the time window.
limit_reduction = other_instances_value * ( time_to_window_end / time_window_duration)

This approach helps to use empirical limit reduction suitable for the current use case. 

Use Case Details

The sliding time window algorithm is the best candidate to be used in the case of a soft blocking strategy. This approach follows the principle that it is better to pass sessions (if decision making is not clear) than to block them.

Edge Use Cases

Decision-making becomes not clear in the case of decentralized limiting system structure changes.

  • The database becomes overloaded or not available state

  • One or several endpoints comes to not available state

  • Database restarted

  • One or several endpoints restarted

Database Timeouts

The real database will not be accessible any time. In case of high load or turn-off service, instances need to work with expected behavior. This behavior can be based at two polar approaches: 

  • Database is not accessible:  no limitation from service peer instances should be applied

  • Database is not accessible: it might come back soon, hold peers' limitation factor

Both approaches have cons in specific use cases. In the case of short database connections lost, the first approach will raise the user limit significantly. That is not reasonable. In the case of long database connections lost, the second approach will hold peers' limiting factors stable for too long to be adequate. 

It is better to use a combination of approaches. Act with approach two for some reasonable time (duration of several time windows) and then ignore peers' limiting factor by the approach number two.

Complex operating on database timeouts

Complex operating on database timeouts

Endpoints Connections Lost

In regards to a lost connection for one or several endpoints, we need to rebalance the limiting factor to endpoints that are available. At the same time, we should consider the situation when this endpoint's absence can be short. To make limiting rebalance, we should store data about previous metric values for each endpoint for some reasonable time window. This data needs to be covered by a retention policy. If metric measurement with time becomes too old, it needs to be deleted. This helps to use the last of available metrics values or combination of metrics values for each peer instance on calculation of peers limitation factor. 

Complex operating on endpoints timeouts

Complex operating on endpoints timeouts

Database Restart

Database restart loses all collected metrics from peer instances. This data can be saved at other places and hot database reservation can help in emergency cases like that. If using just one database, it should be reasonable to have high peers' limiting factors on database start. Then peers' limiting factor needs to be linearly reduced for each new incoming time window after database restart. It can be established inside the database.

Complex database restarting reaction

Complex database restarting reaction

Endpoints Restarted

If endpoints restart, no exchanges with the database can be established on new incoming connections. Connections can be passed with an endpoint instance with no peer limiting factor, or peers' limiting factors related to restart can be decreased linearly with each new incoming time window. 

Endpoints restarting complex reaction

Endpoints restarting complex reaction

Implementation Notes

Algorithms can be implemented and controlled only if they can be tested well during development. This algo is quite complex, and based on states of endpoints and databases, that changes with time. It is challenging to write testable code bound to defined times.

Another question is: what database can be used for a specific use case? It needs to be fast to allow service scaling.  It needs to have complex behavior to compute aggregated peers' limiting factors for each connection. It needs to be transaction-oriented to hold data consistent on many instances connections in parallel. Redis is a candidate that supports Lua scripts to implement all mentioned complex behavior for computing. It is also able to get the data from the instance side, make peers' limiting factor computation, then return it to instance in one transaction. The aggregation script can be held at Redis. Just a few parameters will be transmitted on exchange in this case. This forms a stable basis for parallel connections of instance and transaction orientation to host database consistency.

The last question is how to support this service? Will it work with good quality? Quality measurement can be derived with well-known algorithmic tasks on logs with users' connections and disconnections. On the logs analysis, it is possible to compute the number of parallel connections each time with linear complexity. Next, it is possible to calculate all monopolization metric factors in a sliding window with linear complexity. Finally, it is possible to compare this metric with the required parameters to check the quality of algorithm behavior. 

Conclusion

Even with a quite complex design, it looks simple in detail and is possible to implement a stable, robust, and fail-tolerant algorithm of users limiting on distributed service. In the next chapter, we can look at practical aspects of support and scale of this solution. 

Network service Database Algorithm Factor (programming language) Metric (unit) Host (Unix) Connection (dance) Requests

Opinions expressed by DZone contributors are their own.

Popular on DZone

  • Progressive Web Apps vs Native Apps: Differences and Similarities
  • Java’s Encapsulation - When the Getter and Setter Became Your Enemy
  • When Writing Code Isn't Enough: Citizen Development and the Developer Experience
  • Is DataOps the Future of the Modern Data Stack?

Comments

Performance Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • MVB Program
  • Become a Contributor
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 600 Park Offices Drive
  • Suite 300
  • Durham, NC 27709
  • support@dzone.com
  • +1 (919) 678-0300

Let's be friends:

DZone.com is powered by 

AnswerHub logo