The Testing, Tools, and Frameworks Zone encapsulates one of the final stages of the SDLC as it ensures that your application and/or environment is ready for deployment. From walking you through the tools and frameworks tailored to your specific development needs to leveraging testing practices to evaluate and verify that your product or application does what it is required to do, this Zone covers everything you need to set yourself up for success.
Real-Object Detection at the Edge: AWS IoT Greengrass and YOLOv5
How to Test Multi-Threaded and Concurrent Java
Hi there! Occasionally, there arises a need for swift load testing, whether it be in a local environment or on a testing platform. Typically, such tasks are tackled using specialized tools that demand thorough prior comprehension. However, within enterprises and startups where rapid time-to-market and prompt hypothesis validation are paramount, excessive tool familiarization becomes a luxury. This article aims to spotlight developer-centric solutions that obviate the necessity for profound engagement, allowing for rudimentary testing without delving into pages of documentation. Local Running You should install: Docker - All services and tools are required for it.Java 19+ - For Kotlin service. Also, you can try to use the Java 8 version; it should work, but you have to change the Gradle settings.Golang Python 3+ - For the Yandex.Tank. Tech Requirements Prior to embarking on our journey, it is advisable to generate a couple of services that can serve as illustrative examples for testing purposes. Stack: Kotlin + webflux.r2dbc + Postgres Our service has: get all stocks (limit 10) GET /api/v1/stocksget stock by name GET /api/v1/stock?name=applesave stock POST /api/v1/stock It should be an easy service because we have to focus on load testing. Kotlin and the HTTP Service Let's start by creating a small service with some basic logic inside. We'll prepare a model for this purpose: Kotlin @Table("stocks") data class Stock( @field:Id val id: Long?, val name: String, val price: BigDecimal, val description: String ) Simple router: Kotlin @Configuration @EnableConfigurationProperties(ServerProperties::class) class StockRouter( private val properties: ServerProperties, private val stockHandler: StockHandler ) { @Bean fun router() = coRouter { with(properties) { main.nest { contentType(APPLICATION_JSON).nest { POST(save, stockHandler::save) } GET(find, stockHandler::find) GET(findAll, stockHandler::findAll) } } } } Handler: Kotlin @Service class StockHandlerImpl( private val stockService: StockService ) : StockHandler { private val logger = KotlinLogging.logger {} private companion object { const val DEFAULT_SIZE = 10 const val NAME_PARAM = "name" } override suspend fun findAll(req: ServerRequest): ServerResponse { logger.debug { "Processing find all request: $req" } val stocks = stockService.getAll(DEFAULT_SIZE) return ServerResponse.ok() .contentType(MediaType.APPLICATION_JSON) .body(stocks, StockDto::class.java) .awaitSingle() } override suspend fun find(req: ServerRequest): ServerResponse { logger.debug { "Processing find all request: $req" } val name = req.queryParam(NAME_PARAM) return if (name.isEmpty) { ServerResponse.badRequest().buildAndAwait() } else { val stocks = stockService.find(name.get()) ServerResponse.ok() .contentType(MediaType.APPLICATION_JSON) .body(stocks, StockDto::class.java) .awaitSingle() } } override suspend fun save(req: ServerRequest): ServerResponse { logger.debug { "Processing save request: $req" } val stockDto = req.awaitBodyOrNull(StockDto::class) return stockDto?.let { dto -> stockService.save(dto) ServerResponse .ok() .contentType(MediaType.APPLICATION_JSON) .bodyValue(dto) .awaitSingle() } ?: ServerResponse.badRequest().buildAndAwait() } } Full code here: GitHub Create a docker file: Shell FROM openjdk:20-jdk-slim VOLUME /tmp COPY build/libs/*.jar app.jar ENTRYPOINT ["java", "-Dspring.profiles.active=stg", "-jar", "/app.jar"] Then, build a docker image and tune it: Shell docker build -t ere/stock-service . docker run -p 8085:8085 ere/stock-service But for now, it's better to stick with the idea of running everything through Docker containers and migrate our service into a Docker Compose setup. YAML version: '3.1' services: db: image: postgres container_name: postgres-stocks ports: - "5432:5432" environment: POSTGRES_PASSWORD: postgres adminer: image: adminer ports: - "8080:8080" stock-service: image: ere/stock-service container_name: stock-service ports: - "8085:8085" depends_on: - db Moving Forward How can we proceed with testing? Specifically, how can we initiate a modest load test for our recently developed service? It’s imperative that the testing solution is both straightforward to install and user-friendly. Given our time constraints, delving into extensive documentation and articles isn’t a viable option. Fortunately, there’s a viable alternative—enter Yandex.Tank. The tank is a powerful instrument for testing and has important integrations with JMeter, but in the article, we will use it as a simple tool. source: https://github.com/yandex/yandex-tank docs: https://yandextank.readthedocs.org/en/latest/ Let's kick off by creating a folder for our tests. Once we've placed the configs and other essential files—fortunately, just a couple of them—we'll be all set. For our service, we need to test methods “get-all” and “save.” The first config for find method. YAML phantom: address: localhost port: "8085" load_profile: load_type: rps schedule: line(100, 250, 30s) writelog: all ssl: false connection_test: true uris: - /api/v1/stocks overload: enabled: false telegraf: enabled: false autostop: autostop: - time(1s,10s) # if request average > 1s - http(5xx,100%,1s) # if 500 errors > 1s - http(4xx,25%,10s) # if 400 > 25% - net(xx,25,10) # if amount of non-zero net-codes in every second of last 10s period is more than 25 Key settings for configuration: Address and port: Same as our application.Load test profile (load_profile): We'll use the 'lined' type, ranging from 100 requests per second to 250 with a 30-second limit.URIs: A list of URLs to be tested.Autostop pattern: No need to stress-test if our service has already gone down! Copy and paste the bash script (tank sh): Shell docker run \ -v $(pwd):/var/loadtest \ --net="host" \ -it yandex/yandex-tank And run! What will we see as a result? Yandex.Tank will log everything it deems worthy during the test. We can observe metrics such as the 99th percentile and requests per second (rps). So, are we stuck with the terminal now? I want a GUI! Don't worry, Yandex.Tank has a solution for that too. We can utilize one of the overload plugins. Here's an example of how to add it: Shell overload: enabled: true package: yandextank.plugins.DataUploader job_name: "save docs" token_file: "env/token.txt" We should add our token; just go here and logic by GitHub: https://overload.yandex.net Okay, dealing with a GET request is straightforward, but what about POST? How do we structure the request? The thing is, you can't just throw the request into the tank; you need to create patterns for it! What are these patterns? It's simple—you need to write a small script, which you can again fetch from the documentation and tweak a bit to suit our needs. And we should add our own body and headers: Python #!/usr/bin/env python3 # -*- coding: utf-8 -*- import sys import json # http request with entity body template req_template_w_entity_body = ( "%s %s HTTP/1.1\r\n" "%s\r\n" "Content-Length: %d\r\n" "\r\n" "%s\r\n" ) # phantom ammo template ammo_template = ( "%d %s\n" "%s" ) method = "POST" case = "" headers = "Host: test.com\r\n" + \ "User-Agent: tank\r\n" + \ "Accept: */*\r\n" + \ "Connection: Close\r\n" def make_ammo(method, url, headers, case, body): """ makes phantom ammo """ req = req_template_w_entity_body % (method, url, headers, len(body), body) return ammo_template % (len(req), case, req) def generate_json(): body = { "name": "content", "price": 1, "description": "description" } url = "/api/v1/stock" h = headers + "Content-type: application/json" s1 = json.dumps(body) ammo = make_ammo(method, url, h, case, s1) sys.stdout.write(ammo) f2 = open("ammo/ammo-json.txt", "w") f2.write(ammo) if __name__ == "__main__": generate_json() Result: PowerShell 212 POST /api/v1/stock HTTP/1.1 Host: test.com User-Agent: tank Accept: */* Connection: Close Content-type: application/json Content-Length: 61 {"name": "content", "price": 1, "description": "description"} That’s it! Just run the script, and we will have ammo-json.txt. Just set new params to config, and delete the URLs: YAML phantom: address: localhost:9001 ammo_type: phantom ammofile: ammo-json.txt And run it one more time! It’s Time to Test the GRPC! Having acquainted ourselves with loading HTTP methods, it’s natural to consider the scenario for GRPC. Are we fortunate enough to have an equally accessible tool for GRPC, akin to the simplicity of a tank? The answer is affirmative. Allow me to introduce you to "ghz." Just take a look: https://ghz.sh/ But before we do that, we should create a small service with Go and GRPC as a good test service. Prepare a small proto file: ProtoBuf syntax = "proto3"; option go_package = "stock-grpc-service/stocks"; package stocks; service StocksService { rpc Save(SaveRequest) returns (SaveResponse) {} rpc Find(FindRequest) returns (FindResponse) {} } message SaveRequest { Stock stock = 1; } message SaveResponse { string code = 1; } message Stock { string name = 1; float price = 2; string description = 3; } message FindRequest { enum Type { INVALID = 0; BY_NAME = 1; } message ByName { string name = 1; } Type type = 1; oneof body { ByName by_name = 2; } } message FindResponse { Stock stock = 1; } And generate it! (also, we should install protoc) Shell protoc --go_out=. --go_opt=paths=source_relative --go-grpc_out=. --go-grpc_opt=paths=source_relative stocks.proto Our results: Coding Time! Next steps: Create services as fast as we can. Create dto (stock entity for DB layer) Go package models // Stock – base dto type Stock struct { ID *int64 `json:"Id"` Price float32 `json:"Price"` Name string `json:"Name"` Description string `json:"Description"` } Implement server: Go // Server is used to implement stocks.UnimplementedStocksServiceServer. type Server struct { pb.UnimplementedStocksServiceServer stockUC stock.UseCase } // NewStockGRPCService stock gRPC service constructor func NewStockGRPCService(emailUC stock.UseCase) *Server { return &Server{stockUC: emailUC} } func (e *Server) Save(ctx context.Context, request *stocks.SaveRequest) (*stocks.SaveResponse, error) { model := request.Stock stockDto := &models.Stock{ ID: nil, Price: model.Price, Name: model.Name, Description: model.Description, } err := e.stockUC.Create(ctx, stockDto) if err != nil { return nil, err } return &stocks.SaveResponse{Code: "ok"}, nil } func (e *Server) Find(ctx context.Context, request *stocks.FindRequest) (*stocks.FindResponse, error) { code := request.GetByName().GetName() model, err := e.stockUC.GetByID(ctx, code) if err != nil { return nil, err } response := &stocks.FindResponse{Stock: &stocks.Stock{ Name: model.Name, Price: model.Price, Description: model.Description, } return response, nil } Full code here. Test It! Install GHz with brew (as usual).Let's check a simple example here. Now, we should change it a little bit: Move to the folder with the proto files.Add method: stocks.StocksService.Save.Add simple body: {“stock”: { “name”:”APPL”, “price”: “1.3”, “description”: “apple stocks”} }10 connections will be shared among 20 goroutine workers. Each pair of 2 goroutines will share a single connection.Set service’s port. And the result: Shell cd .. && cd stock-grpc-service/proto ghz --insecure \ --proto ./stocks.proto \ --call stocks.StocksService.Save \ -d '{"stock": { "name":"APPL", "price": "1.3", "description": "apple stocks"} }' \ -n 2000 \ -c 20 \ --connections=10 \ 0.0.0.0:5007 Run it! Plain Text Summary: Count: 2000 Total: 995.93 ms Slowest: 30.27 ms Fastest: 3.11 ms Average: 9.19 ms Requests/sec: 2008.16 Response time histogram: 3.111 [1] | 5.827 [229] |∎∎∎∎∎∎∎∎∎∎∎ 8.542 [840] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 11.258 [548] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ 13.973 [190] |∎∎∎∎∎∎∎∎∎ 16.689 [93] |∎∎∎∎ 19.405 [33] |∎∎ 22.120 [29] |∎ 24.836 [26] |∎ 27.551 [6] | 30.267 [5] | Latency distribution: 10 % in 5.68 ms 25 % in 6.67 ms 50 % in 8.27 ms 75 % in 10.49 ms 90 % in 13.88 ms 95 % in 16.64 ms 99 % in 24.54 ms Status code distribution: [OK] 2000 responses And what, stare at everything in the terminal again? No, with ghz, you can also generate a report, but unlike Yandex, it will be generated locally and can be opened in the browser. Just set it: Shell ghz --insecure -O html -o reports_find.html -O + html → output format-o filename Conclusion In summary, when you need a swift assessment of your service's ability to handle a load of 100+ requests per second or identify potential weaknesses, there's no need to initiate intricate processes involving teams, seeking assistance from AQA, or relying on the infrastructure team. More often than not, developers have capable laptops and computers that can execute a small load test. So, go ahead and give it a shot—save yourself some time! I trust you found this brief article beneficial. Valuable Documentation I Recommend Reading: Just in case, if you need more: Yandex.Tank docs Yandex.Tank GitHubYandex.Tank Settingghz official pageghz config:link May the Force Be With You! Thanks once again, and best of luck!
This article is a follow-up to the article that lays the theoretical foundation for software requirement qualities. Here, I provide an example for how to craft requirements for a User Authentication Login Endpoint. A practical illustration of how essential software requirement qualities can be interwoven when designing specifications for AI-generated code. I demonstrate the crucial interplay between explicitness (to achieve completeness), unambiguity (for machine-first understandability), constraint definition (to guide implementation and ensure viability), and testability (through explicit acceptance criteria). We'll explore how these qualities can practically be achieved through structured documentation. Our goal is that our AI assistant has a clear, actionable blueprint for generating a secure and functional login service. For explanatory purposes and to make clear how things work, I will provide a detailed requirements document. A blueprint that is by no means exhaustive, but it can serve as the basis for understanding and expanding. Documentation can be lightweight in practice, but this article must focus on details to avoid confusion. The document starts by stating the requirement ID and title. A feature's description follows, along with its functional and non-functional requirements. Data definitions, implementation constraints, acceptance criteria, and error handling fundamentals are also documented. Requirement Document: User Authentication - Login Endpoint 1. Requirement ID and Title Unique IDs are crucial for traceability, allowing you to link this specific requirement to design documents, generated code blocks, and test cases. This helps in maintenance and debugging. ID: REQ-AUTH-001Title: User Login Endpoint 2. Feature Description The feature description provides a high-level overview and context. For AI, this helps establish the overall goal before diving into specifics. It answers the "what" at a broad level. This feature provides an API endpoint for registered users to authenticate themselves using their email address and password. Successful authentication will grant access by providing a session token. 3. Functional Requirements (FR) Functional requirements are broken down into atomic, specific statements. Keywords like MUST, SHOULD (though only MUST is used here for strictness) can follow RFC 2119 style, which AI-assistants can be trained to recognize. "Case-insensitive search," "structurally valid email format," and specific counter actions (increment, reset) leave little room for AI misinterpretation. This promotes unambiguity and precision. Details like checking if an account is disabled and the account lockout mechanism (FR11) cover crucial edge cases and security aspects, aiming for explicitness and completeness. FR1: The system MUST expose an HTTPS POST endpoint at /api/v1/auth/login.FR2: The endpoint MUST accept a JSON payload containing email (string) and password (string).FR3: The system MUST validate the provided email: FR3.1: It MUST be a non-empty string.FR3.2: It MUST be a structurally valid email format (e.g., [email protected]).FR4: The system MUST validate the provided password: FR4.1: It MUST comply with a strong password policy.FR5: If input validation (FR3, FR4) fails, the system MUST return an error (see Error Handling EH1).FR6: The system MUST retrieve the user record from the Users database table based on the provided email.FR7: If no user record is found for the email, or if the user account is marked as disabled, the system MUST return an authentication failure error (see Error Handling EH2).FR8: If a user record is found and the account is active, the system MUST verify the provided password against the stored hashed password for the user using the defined password hashing algorithm (see IC3: Security).FR9: If password verification fails, the system MUST increment a failed_login_attempts counter for the user and return an authentication failure error (see Error Handling EH2).FR10: If password verification is successful: FR10.1: The system MUST reset the failed_login_attempts counter for the user to 0.FR10.2: The system MUST generate a JSON Web Token (JWT) (see IC3: Security for JWT specifications).FR10.3: The system MUST return a success response containing the JWT (see Data Definitions - Output).FR11: Account lockout: If failed_login_attempts for a user reaches 5, their account MUST be temporarily locked for 15 minutes. Attempts to log in to a locked account MUST return an account locked error (see Error Handling EH3), even with correct credentials. 4. Data Definitions Clearly defining data definitions (schemas) for inputs and outputs is critical for AI to generate correct data validation, serialization, and deserialization logic. Using terms like "string, required, format: email" helps the AI map to data types and validation rules (e.g., when using Pydantic models). This contributes to Structured Input. Input Payload (JSON): email (string, required, format: email)password (string, required, minLength: 1)Success Output (JSON, HTTPS 200): access_token (string, JWT format)token_type (string, fixed value: "Bearer")expires_in(integer, seconds, representing token validity duration)Error Output (JSON, specific HTTPS status codes - see Error Handling): error_code (string, e.g., "INVALID_INPUT", "AUTH_FAILED", "ACCOUNT_LOCKED")message (string, human-readable error description) 5. Non-Functional Requirements (NFRs) NFRs reduce ambiguity, guide code generation toward aligned behaviors, and make the resulting software easier to verify against clearly defined benchmarks. They make qualities like performance and security testable and unambiguous. Specific millisecond targets and load conditions are set. Also, as an example, specific actions (no password logging, input sanitization) and references to further constraints (IC3) are provided. NFR1 (Performance): The average response time for the login endpoint MUST be less than 300ms under a load of 100 concurrent requests. P99 response time MUST be less than 800ms.NFR2 (Security): All password handling must adhere to security constraints specified in IC3. No sensitive information (passwords) should be logged. Input sanitization must be performed to prevent common injection attacks.NFR3 (Auditability): Successful and failed login attempts MUST be logged to the audit trail with timestamp, user email (for failed attempts, if identifiable), source IP address, and success/failure status. Failed attempts should include the specific failure reason (e.g., "user_not_found," "incorrect_password," "account_locked"). 6. Implementation Constraints and Guidance (IC) This section guides the AI's choices (Python/FastAPI, SQLAlchemy, Pydantic, bcrypt, JWT structure) without dictating the exact low-level code. For the purposes of this article, these specific choices are random and are not considered to be optimal in any sense. You are free to choose your own tech stack, architectural patterns, etc. Implementation constraints can guide towards Viability within the project's ecosystem and to meet specific security and architectural requirements. Also, it should be mentioned that the constraints shown are indicative and are by no means exhaustive. Currently, it depends on the specific AI assistant and the project under development, which constraints are more appropriate. Will there be AI assistants that develop code perfectly without constraints and guidance from humans? It remains to be seen. IC1 (Technology Stack): Backend Language/Framework: Python 3.11+ / FastAPI.Data Validation: Pydantic models derived from Data Definitions.Database Interaction: Use SQLAlchemy ORM with the existing project database session configuration. Target table: Users.IC2 (Architectural Pattern): Logic should be primarily contained within a dedicated AuthenticationService class. The API endpoint controller should delegate to this service.IC3 (Security - Password and Token): Password Hashing: Stored passwords MUST be hashed using bcrypt with a work factor of 12.JWT Specifications: Algorithm: HS256.Secret Key: Retrieved from environment variable JWT_SECRET_KEY.Payload Claims: MUST include sub (user_id), email, exp (expiration time), iat (issued at).Expiration: Tokens MUST expire 1 hour after issuance.IC4 (Environment): The service will be deployed as a Docker container. Configuration values (like JWT_SECRET_KEY, database connection string) MUST be configurable via environment variables.IC5 (Coding Standards): Adhere to PEP 8 style guide.All functions and methods MUST include type hints.All public functions/methods MUST have docstrings explaining purpose, arguments, and return values. 7. Acceptance Criteria (AC - Gherkin Format) Acceptance criteria make the requirements Testable. Gherkin is an example format that is human-readable and structured. A behaviour-driven development tool that can also be used for AI assistants to derive specific test cases. It can cover happy paths and key error/edge cases, providing concrete examples of expected behavior. This gives clear verification targets for the AI-generated code. Plain Text Feature: User Login API Endpoint Background: Given a user "[email protected]" exists with a bcrypt hashed password for "ValidPassword123" And the user account "[email protected]" is not disabled And the user "[email protected]" has 0 failed_login_attempts And the JWT_SECRET_KEY environment variable is set Scenario: Successful Login with Valid Credentials When a POST request is made to "/api/v1/auth/login" with JSON body: """ { "email": "[email protected]", "password": "ValidPassword123" } """ Then the response status code should be 200 And the response JSON should contain an "access_token" (string) And the response JSON should contain "token_type" with value "Bearer" And the response JSON should contain "expires_in" with value 3600 And the "access_token" should be a valid JWT signed with HS256 containing "sub", "email", "exp", "iat" claims And the failed_login_attempts for "[email protected]" should remain 0 Scenario: Login with Invalid Password When a POST request is made to "/api/v1/auth/login" with JSON body: """ { "email": "[email protected]", "password": "InvalidPassword" } """ Then the response status code should be 401 And the response JSON should contain "error_code" with value "AUTHENTICATION_FAILED" And the response JSON should contain "message" with value "Invalid email or password." And the failed_login_attempts for "[email protected]" should be 1 Scenario: Login with Non-Existent Email When a POST request is made to "/api/v1/auth/login" with JSON body: """ { "email": "[email protected]", "password": "AnyPassword" } """ Then the response status code should be 401 And the response JSON should contain "error_code" with value "AUTHENTICATION_FAILED" And the response JSON should contain "message" with value "Invalid email or password." Scenario: Account Lockout after 5 Failed Attempts Given the user "[email protected]" has 4 failed_login_attempts When a POST request is made to "/api/v1/auth/login" with JSON body: # This is the 5th failed attempt """ { "email": "[email protected]", "password": "InvalidPasswordAgain" } """ Then the response status code should be 403 And the response JSON should contain "error_code" with value "ACCOUNT_LOCKED" And the response JSON should contain "message" with value "Account is temporarily locked due to too many failed login attempts." And the failed_login_attempts for "[email protected]" should be 5 8. Error Handling (EH) This dedicated error handling section ensures Completeness by explicitly defining how different failure scenarios are communicated. To improve completeness we need to extensively cover edge cases and error handling. Specify exactly how different errors (validation errors, system errors, network errors) should be caught, logged, and communicated to the user (specific error messages, codes). EH1 (Invalid Input): Trigger: FR3 or FR4 fails.HTTPS Status: 400 Bad Request.Response Body: { "error_code": "INVALID_INPUT", "message": "Invalid input. Email must be valid and password must not be empty." } (Example message, could be more specific based on which field failed).EH2 (Authentication Failure): Trigger: FR7 or FR9 occurs.HTTPS Status: 401 Unauthorized.Response Body: { "error_code": "AUTHENTICATION_FAILED", "message": "Invalid email or password." } (Generic message to prevent user enumeration).EH3 (Account Locked): Trigger: Attempt to log in to an account that is locked per FR11.HTTP Status: 403 Forbidden.Response Body: { "error_code": "ACCOUNT_LOCKED", "message": "Account is temporarily locked due to too many failed login attempts." } Final Remarks The dual purpose. The example User Authentication Login Endpoint requirement is carefully chosen so that it can be used for two purposes. The first is to explain the basic qualities of software requirements, irrespective of who writes the code (a human or AI). The second purpose is to focus on AI-assisted code and how to use requirements to our advantage.Examples used are not exhaustive. All data and examples presented in the eight paragraphs, from requirement ID and title to error handling, are indicative. Many more functional/non-functional requirements can be crafted, as well as data definitions. Acceptance criteria and error handling cases are a minimal sample of what is usually needed in practice. Negative constraints (don't use Z, avoid pattern A), for example, are not provided here but can be very beneficial as well. And of course, you may find that there are other paragraphs, beyond the scope of this article, that are tailored to your documentation needs.Documentation is not static. For clarity and completeness in this article, the documentation for the User Authentication Login Endpoint seems to be static. All is well specified upfront and then fed to the AI-assistant that does the job for us. Although a detailed document can be a good starting point, factors like implementation constraints and guidance can be fully interactive. For AI assistants, for example, with sophisticated chat interfaces, a "dialogue" with AI can be an important part of the process. While initial implementation constraints can be vital, some constraints might be refined or even discovered through interaction with the AI. Wrapping Up I provided a requirements document for a User Authentication Login Endpoint requirement. This example document attempts to be explicit, precise, and constrained. Software requirements must necessarily be viable whilst eminently testable. It's structured to provide an AI code generator with sufficient detail to minimize guesswork and the chances of the AI producing undesirable output. While AI code assistants will probably be more capable and context-aware, the fundamental need for human-defined guidance appears to remain. Guiding an AI assistant for software development could be embedded in project templates. It could be via custom AI assistant configurations (if available), or even as part of a "system prompt" that always precedes specific task prompts. A dynamic set of principles that inform an ongoing interaction with AI can be based on the following. Initial scaffolding: We provide the critical initial direction, ensuring the AI starts on the right path aligned with project standards, architecture, and non-negotiable requirements (especially security).Basis for interaction: Our documentation becomes the foundation for interactive refinement. When the AI produces output, it can be evaluated against our documented requirements.Evolving knowledge base: As the project progresses, parts of our documentation can be updated, or new ones added, reflecting new decisions or learnings.Guardrails for AI autonomy: As AIs gain more autonomy in suggesting larger code blocks or even architectural components, such documents can act as essential guardrails, ensuring their "creativity" stays within acceptable project boundaries.
Software development teams and professionals are increasingly adopting vibe coding as their preferred approach. Vibe coding involves creating software through instinctual coding methods and minimal planning to achieve quick prototyping or making solutions work immediately. While vibe coding can spark creativity and speed up early development, it usually comes at the cost of security, maintainability, and reliability. This article analyzes the security vulnerabilities of vibe coding and provides essential guidance for developers and organizations to minimize these risks while preserving innovative processes. What Is “Vibe Coding,” Exactly? Vibe coding lacks formal status as a methodology but serves as cultural shorthand. Coding without specifications or architectural planning.Developers sometimes bypass code reviews and testing procedures to expedite product delivery by shipping their applications.Developers depend on Stack Overflow, GitHub Copilot, or ChatGPT excessively while lacking comprehension.Developers often choose to release operational code instead of ensuring it meets the standards of security and scalability or providing documentation. Vibe coding relies heavily on AI tools to generate, refine, and debug code, enabling rapid application iteration and deployment with minimal manual effort. For non-coders dreaming of creating their apps, AI offers an enticing gateway to turn ideas into reality, even profitable ones. However, without professional developer review, AI-generated code can introduce dangerous security vulnerabilities, performance bottlenecks, and critical errors that undermine your entire project. It’s fun, fast, and chaotic, but it’s also a minefield of security vulnerabilities. The Hidden Security Risks of Vibe Coding and Mitigation Strategies Here's the catch-22 with AI: it won't alert you to security vulnerabilities you don't know exist. Think about it — how can you secure systems you don't fully grasp? And if AI built it, chances are you don't truly understand its inner workings. 1. Hardcoded Secrets and Credentials For convenience, vibe coders habitually insert API keys, database passwords, or tokens as plain text within their source code. GitGuardian's alarming report reveals a critical security crisis: 24 million secrets leaked on GitHub last year alone. More troubling still, repositories using AI coding tools, which many developers now rely on, experienced a 40% higher exposure rate. This dangerous trend demands immediate attention from development teams everywhere. Risks Exposure in public repos or error logs.Easy targets for attackers scanning GitHub. Mitigation Use environment variables or a secure secrets management system (e.g., AWS Secrets Manager, Vault).Implement secret scanning tools like GitGuardian or truffleHog. 2. Lack of Input Validation and Sanitization When developers use improvised coding methods, they tend to ignore basic hygiene practices such as user input validation, which results in SQL Injection and other serious vulnerabilities like XSS and Command Injection. Risks Code or commands can be executed from user-supplied input.Data leaks, defacement, or remote access. Mitigation Always validate and sanitize inputs using frameworks (e.g., Joi for Node.js, Marshmallow for Python).Use ORM libraries to prevent SQL injection. 3. Insecure Use of Third-party Libraries Developers often quickly implement solutions by installing an NPM package or Python module from a blog without verifying its security credentials. Risks Supply chain attacks.Malware hidden in typosquatting libraries (e.g., requests vs requestr). Mitigation Use tools like OWASP Dependency-Check or npm audit.Lock versions using package-lock.json, poetry.lock, or pip-tools. 4. Improper Authentication and Authorization Developers frequently create authentication logic too quickly, leading to basic token usage and missed session expiration, as well as role verification processes. Risks Privilege escalation.Account takeover or horizontal access control issues. Mitigation Use industry-tested authentication libraries (e.g., OAuth2.0 via Auth0, Firebase Auth).Implement RBAC or ABAC strategies and avoid custom auth logic. 5. Missing or Insecure Logging The urgency to "just make it work" leads developers to either neglect logging or to log sensitive data. Risks Logs may leak PII, passwords, or tokens.Lack of traceability during incident response. Mitigation Use centralized log systems (e.g., ELK stack, Datadog).Mask sensitive data and ensure proper log rotation. 6. No Security Testing or Code Reviews Code written on vibes is rarely peer-reviewed or subjected to security testing, leaving glaring vulnerabilities undetected. Risks Vulnerabilities stay hidden until exploited.One developer’s mistake could compromise the whole application. Mitigation Automate security testing (SAST, DAST).Enforce code reviews and integrate Git hooks or CI pipelines with tools like SonarQube or GitHub Actions. How to Keep the Vibe and Still Ship Secure Code Best practicedescription Security Champions Appoint team members to advocate for secure practices even during fast-paced dev. Secure Defaults Use templates and boilerplates that include basic security setup. Automated Linting and Testing Add ESLint, Bandit, or security linters to your CI/CD. Threat Modeling Lite Do a 10-minute risk brainstorm before coding a feature. Security as Culture Teach developers how security adds value, not just overhead. Conclusion Vibe coding delivers a fast-paced and enjoyable development experience that unlocks creative freedom. Excessive vibe coding generates critical security vulnerabilities that can damage your team's reputation and lead to lost customer trust. Through the implementation of lightweight security measures and automated checks within your development process, you maintain the creative advantages of vibe coding and safeguard both your application and its users. Even when coding with good vibes, developers must maintain responsible coding practices. Tools You Should Bookmark OWASP Top 10 – https://owasp.org/www-project-top-ten/ Semgrep – https://semgrep.dev/ GitGuardian – https://www.gitguardian.com/ Snyk – https://snyk.io/ OWASP Dependency-Check – https://owasp.org/www-project-dependency-check/ Got questions or want to share your worst vibe-coding disaster? Drop a comment below.
I keep finding myself in conversations with family and friends asking, “Is AI coming for our jobs?” Which roles are getting Thanos-snapped first? And will there still be space for junior individual contributors in organizations? And many more. With so many conflicting opinions, I felt overwhelmed and anxious, so I decided to take action instead of staying stuck in uncertainty. So, I began collecting historical data and relevant facts to gain a clearer understanding of the direction and impact of the current AI surge. So, Here’s What We Know Microsoft reports that over 30% of the code on GitHub Copilot is now AI-generated, highlighting a shift in how software is being developed. Major tech companies — including Google, Meta, Amazon, and Microsoft — have implemented widespread layoffs over the past 18–24 months. Current generative AI models, like GPT-4 and CodeWhisperer, can reliably write functional code, particularly for standard, well-defined tasks.Productivity gains: Occupations in which many tasks can be performed by AI are experiencing nearly five times higher growth in productivity than the sectors with the least AI adoption.AI systems still require a human “prompt” or input to initiate the thinking process. They do not ideate independently or possess genuine creativity — they follow patterns and statistical reasoning based on training data.Despite rapid progress, today’s AI is still far from achieving human-level general intelligence (AGI). It lacks contextual awareness, emotional understanding, and the ability to reason abstractly across domains without guidance or structured input.Job displacement and creation: The World Economic Forum's Future of Jobs Report 2025 reveals that 40% of employers expect to reduce their workforce where AI can automate tasks.And many more. There’s a lot of conflicting information out there, making it difficult to form a clear picture. With so many differing opinions, it's important to ground the discussion in facts. So, let’s break it down from a data engineer’s point of view — by examining the available data, identifying patterns, and drawing insights that can help us make sense of it all. Navigating the Noise Let’s start with the topic that’s on everyone’s mind — layoffs. It’s the most talked-about and often the most concerning aspect of the current tech landscape. Below is a trend analysis based on layoff data collected across the tech industry. Figure 1: Layoffs (in thousands) over time in tech industries Although the first AI research boom began in the 1980s, the current AI surge started in the late 2010s and gained significant momentum in late 2022 with the public release of OpenAI's ChatGPT. The COVID-19 pandemic further complicated the technological landscape. Initially, there was a hiring surge to meet the demands of a rapidly digitizing world. However, by 2023, the tech industry experienced significant layoffs, with over 200,000 jobs eliminated in the first quarter alone. This shift was attributed to factors such as economic downturns, reduced consumer demand, and the integration of AI technologies. Since then, as shown in Figure 1, layoffs have continued intermittently, driven by various factors including performance evaluations, budget constraints, and strategic restructuring. For instance, in 2025, companies like Microsoft announced plans to lay off up to 6,800 employees, accounting for less than 3% of its global workforce, as part of an initiative to streamline operations and reduce managerial layers. Between 2024 and early 2025, the tech industry experienced significant workforce reductions. In 2024 alone, approximately 150,000 tech employees were laid off across more than 525 companies, according to data from the US Bureau of Labor Statistics. The trend has continued into 2025, with over 22,000 layoffs reported so far this year, including a striking 16,084 job cuts in February alone, highlighting the ongoing volatility in the sector. It really makes me think — have all these layoffs contributed to the rise in the US unemployment rate? And has the number of job openings dropped too? I think it’s worth taking a closer look at these trends. Figure 2: Employment and unemployment counts in the US from JOLTS DB Figure 2 illustrates employment and unemployment trends across all industries in the United States. Interestingly, the data appear relatively stable over the past few years, which raises some important questions. If layoffs are increasing, where are those workers going? And what about recent graduates who are still struggling to land their first jobs? We’ve talked about the layoffs — now let’s explore where those affected are actually going. While this may not reflect every individual experience, here’s what the available online data reveals. After the Cuts Well, I wondered if the tech job openings have decreased as well? Figure 3: Job openings over the years in the US Even with all the news about layoffs, the tech job market isn’t exactly drying up. As of May 2025, there are still around 238,000 open tech positions across startups, unicorns, and big-name public companies. Just back in December 2024, more than 165,000 new tech roles were posted, bringing the total to over 434,000 active listings that month alone. And if we look at the bigger picture, the US Bureau of Labor Statistics expects an average of about 356,700 tech job openings each year from now through 2033. A lot of that is due to growth in the industry and the need to replace people leaving the workforce. So yes — while things are shifting, there’s still a strong demand for tech talent, especially for those keeping up with evolving skills. With so many open positions still out there, what’s causing the disconnect when it comes to actually finding a job? New Wardrobe for Tech Companies If those jobs are still out there, then it’s worth digging into the specific skills companies are actually hiring for. Recent data from LinkedIn reveals that job skill requirements have shifted by approximately 25% since 2015, and this pace of change is accelerating, with that number expected to double by 2027. In other words, companies are now looking for a broader and more updated set of skills than what may have worked for us over the past decade. Figure 4: Skill bucket The graph indicates that technical skills remain a top priority, with 59% of job postings emphasizing their importance. In contrast, soft skills appear to be a lower priority, mentioned in only 46% of listings, suggesting that companies are still placing greater value on technical expertise in their hiring criteria. Figure 5: AI skill requirement in the US Focusing specifically on the comparison between all tech jobs and those requiring AI skills, a clear trend emerges. As of 2025, around 19% to 25% of tech job postings now explicitly call for AI-related expertise — a noticeable jump from just a few years ago. This sharp rise reflects how deeply AI is becoming embedded across industries. In fact, nearly one in four new tech roles now list AI skills as a core requirement, more than doubling since 2022. Figure 6: Skill distribution in open jobs Python remains the most sought-after programming language in AI job postings, maintaining its top position from previous years. Additionally, skills in computer science, data analysis, and cloud platforms like Amazon Web Services have seen significant increases in demand. For instance, mentions of Amazon Web Services in job postings have surged by over 1,778% compared to data from 2012 to 2014 While the overall percentage of AI-specific job postings is still a small fraction of the total, the upward trend underscores the growing importance of AI proficiency in the modern workforce. Final Thought I recognize that this analysis is largely centered on the tech industry, and the impact of AI can look very different across other sectors. That said, I’d like to leave you with one final thought: technology will always evolve, and the real challenge is how quickly we can evolve with it before it starts to leave us behind. We’ve seen this play out before. In the early 2000s, when data volumes were manageable, we relied on database developers. But with the rise of IoT, the scale and complexity of data exploded, and we shifted toward data warehouse developers, skilled in tools like Hadoop and Spark. Fast forward to the 2010s and beyond, we’ve entered the era of AI and data engineers — those who can manage the scale, variety, and velocity of data that modern systems demand. We’ve adapted before — and we’ve done it well. But what makes this AI wave different is the pace. This time, we need to adapt faster than we ever have in the past.
Recently, my team encountered a critical production issue in which Apache Airflow tasks were getting stuck in the "queued" state indefinitely. As someone who has worked extensively with Scheduler, I've handled my share of DAG failures, retries, and scheduler quirks, but this particular incident stood out both for its technical complexity and the organizational coordination it demanded. The Symptom: Tasks Stuck in Queued It began when one of our business-critical Directed Acyclic Graphs (DAGs) failed to complete. Upon investigation, we discovered several tasks were stuck in the "queued" state — not running, failing, or retrying, just permanently queued. First Steps: Isolating the Problem A teammate and I immediately began our investigation with the fundamental checks: Examined Airflow UI logs: Nothing unusual beyond standard task submission entriesReviewed scheduler and worker logs: The scheduler was detecting the DAGs, but nothing was reaching the workersConfirmed worker health: All Celery workers showed as active and runningRestarted both scheduler and workers: Despite this intervention, tasks remained stubbornly queued Deep Dive: Uncovering a Scheduler Bottleneck We soon suspected a scheduler issue. We observed that the scheduler was queuing tasks but not dispatching them. This led us to investigate: Slot availability across workersMessage queue health (RabbitMQ in our environment)Heartbeat communication logs We initially hypothesized that the scheduler machine might be over occupied because of the dual responsibility of scheduling and DAG parsing other tasks, so we increased the min_file_process_interval to 2 mins. While this reduced CPU utilization by limiting how frequently the scheduler parsed DAG files, it didn't resolve our core issue — tasks remained stuck in the queued state. After further research, we discovered that our Airflow version (2.2.2) contained a known issue causing tasks to become trapped in the queued state under specific scheduler conditions. This bug was fixed in Airflow 2.6.0, with the solution documented in PR #30375. However, upgrading wasn't feasible in the short term. The migration from 2.2.2 to 2.6.0 would require extensive testing, custom plugin adjustments, and deployment pipeline modifications — none of which could be implemented quickly without disrupting other priorities. Interim Mitigations and Configuration Optimizations While working on the backported fix, we implemented several tactical measures to stabilize the system: Increased parsing_processes to 8 to parallelize and improve the DAG parsing timeIncreased scheduler_heartbeat_sec to 30s and increased min_file_process_interval to 120s (up from the default setting of 30s) to reduce scheduler loadImplemented continuous monitoring to ensure tasks were being processed appropriatelyWe also deployed a temporary workaround using a script referenced in this GitHub comment. This script forcibly transitions tasks from queued to running state. We scheduled it via a cron job with an additional filter targeting only task instances that had been queued for more than 10 minutes. This approach provided temporary relief while we finalized our long-term solution. However, we soon discovered limitations with the cron job. While effective for standard tasks that could eventually reach completion once moved from queued to running, it was less reliable for sensor-related tasks. After being pushed to running state, sensor tasks would often transition to up_for_reschedule and then back to queued, becoming stuck again. This required the cron job to repeatedly advance these tasks, essentially functioning as an auxiliary scheduler. We suspect this behavior stems from inconsistencies between the scheduler's in-memory state and the actual task states in the database. This unintentionally made our cron job responsible for orchestrating part of the sensor lifecycle — clearly not a sustainable solution. The Fix: Strategic Backporting After evaluating our options, we decided to backport the specific fix from Airflow 2.6.0 to our existing 2.2.2 environment. This approach allowed us to implement the necessary correction without undertaking a full upgrade cycle. We created a targeted patch by cherry-picking the fix from the upstream PR and applying it to our forked version of Airflow. The patch can be viewed here: GitHub Patch. How to Apply the Patch Important disclaimer: The patch referenced in this article is specifically designed for Airflow deployments using the Celery executor. If you're using a different executor (such as Kubernetes, Local, or Sequential), you'll need to backport the appropriate changes for your specific executor from the original PR (#30375). The file paths and specific code changes may differ based on your executor configuration. If you're facing similar issues, here's how to apply this patch to your Airflow 2.2.2 installation: Download the Patch File First, download the patch from the GitHub link provided above. You can use wget or directly download the patch file: Shell wget -O airflow-queued-fix.patch https://github.com/gurmeetsaran/airflow/pull/1.patch Navigate to Your Airflow Installation Directory This is typically where your Airflow Python package is installed. Shell cd /path/to/your/airflow/installation Apply the Patch Using git Use the git apply command to apply the patch: Shell git apply --check airflow-queued-fix.patch # Test if the patch can be applied cleanly git apply airflow-queued-fix.patch # Actually apply the patch Restart your Airflow scheduler to apply the changes.Monitor task states to verify that newly queued tasks are being properly processed by the scheduler. Note that this approach should be considered a temporary solution until you can properly upgrade to a newer Airflow version that contains the official fix. Organizational Lessons Resolving the technical challenge was only part of the equation. Equally important was our approach to cross-team communication and coordination: We engaged our platform engineering team early to validate our understanding of Airflow's architecture.We maintained transparent communication with stakeholders so they could manage downstream impacts.We meticulously documented our findings and remediation steps to facilitate future troubleshooting.We learned the value of designating a dedicated communicator — someone not involved in the core debugging but responsible for tracking progress, taking notes, and providing regular updates to leadership, preventing interruptions to the engineering team. We also recognized the importance of assembling the right team — collaborative problem-solvers focused on solutions rather than just identifying issues. Establishing a safe, solution-oriented environment significantly accelerated our progress. I was grateful to have the support of a thoughtful and effective manager who helped create the space for our team to stay focused on diagnosing and resolving the issue, minimizing external distractions. Key Takeaways This experience reinforced several valuable lessons: Airflow is powerful but sensitive to scale and configuration parametersComprehensive monitoring and detailed logging are indispensable diagnostic toolsSometimes the issue isn't a failing task but a bottleneck in the orchestration layerVersion-specific bugs can have widespread impact — staying current helps, even when upgrades require planningBackporting targeted patches can be a pragmatic intermediate solution when complete upgrades aren't immediately feasibleEffective cross-team collaboration can dramatically influence incident response outcomes This incident reminded me that while technical expertise is fundamental, the ability to coordinate and communicate effectively across teams is equally crucial. I hope this proves helpful to others who find themselves confronting a mysteriously stuck Airflow task and wondering, "Now what?"
Imagine you're building a skyscraper—not just quickly, but with precision. You rely on blueprints to make sure every beam and every bolt is exactly where it should be. That’s what Infrastructure as Code (IaC) is for today’s cloud-native organizations—a blueprint for the cloud. As businesses race to innovate faster, IaC helps them automate and standardize how cloud resources are built. But here’s the catch: speed without security is like skipping the safety checks on that skyscraper. One misconfigured setting, an exposed secret, or a non-compliant resource can bring the whole thing down—or at least cause serious trouble in production. That’s why the shift-left approach to secure IaC matters more than ever. What Does “Shift-Left” Mean in IaC? Shifting left refers to moving security and compliance checks earlier in the development process. Rather than waiting until deployment or runtime to detect issues, teams validate security policies, compliance rules, and access controls as code is written—enabling faster feedback, reduced rework, and stronger cloud governance. For IaC, this means, Scanning Terraform templates and other configuration files for vulnerabilities and misconfigurations before they are deployed.Validating against cloud-specific best practices.Integrating policy-as-code and security tools into CI/CD pipelines. Why Secure IaC Matters? IaC has completely changed the game when it comes to managing cloud environments. It’s like having a fast-forward button for provisioning—making it quicker, more consistent, and easier to repeat across teams and projects. But while IaC helps solve a lot of the troubles around manual operations, it’s not without its own set of risks. The truth is, one small mistake—just a single misconfigured line in a Terraform script—can have massive consequences. It could unintentionally expose sensitive data, leave the door open for unauthorized access, or cause your setup to drift away from compliance standards. And because everything’s automated, those risks scale just as fast as your infrastructure. In cloud environments like IBM Cloud, where IaC tools like Terraform and Schematics automate the creation of virtual servers, networks, storage, and IAM policies, a security oversight can result in- Publicly exposed resources (e.g., Cloud Object Storage buckets or VPC subnets).Over-permissive IAM roles granting broader access than intended.Missing encryption for data at rest or in transit.Hard-coded secrets and keys within configuration files.Non-compliance with regulatory standards like GDPR, HIPAA, or ISO 27001. These risks can lead to data breaches, service disruptions, and audit failures—especially if they go unnoticed until after deployment. Secure IaC ensures that security and compliance are not afterthoughts but are baked into the development process. It enables: Early detection of mis-configurations and policy violations.Automated remediation before deployment.Audit-ready infrastructure, with traceable and versioned security policies.Shift-left security, empowering developers to code safely without slowing down innovation. When done right, Secure IaC acts as a first line of defense, helping teams deploy confidently while reducing the cost and impact of security fixes later in the lifecycle. Components of Secure IaC Framework The Secure IaC Framework is structured into layered components that guide organizations in embedding security throughout the IaC lifecycle. Building Blocks of IaC (Core foundation for all other layers)—These are the fundamental practices required to enable any Infrastructure as Code approach. Use declarative configuration (e.g. Terraform, YAML, JSON).Embrace version control (e.g. Git) for all infrastructure code.Define idempotent and modular code for reusable infrastructure.Enable automation pipelines (CI/CD) for repeatable deployments.Follow consistent naming conventions, tagging policies, and code linting.Build Secure Infrastructure- Focuses on embedding secure design and architectural patterns into the infrastructure baseline. Use secure-by-default modules (e.g. encryption, private subnets).Establish network segmentation, IAM boundaries, and resource isolation.Configure monitoring, logging, and default denial policies.Choose secure providers and verified module sources.Automate Controls - Empowers shift-left security by embedding controls into the development and delivery pipelines. Run static code analysis (e.g. Trivy, Checkov) pre-commit and in CI.Enforce policy-as-code using OPA or Sentinel for approvals and denials.Integrate configuration management and IaC test frameworks (e.g. Terratest).Detect & Respond - Supports runtime security through visibility, alerting, and remediation.Enable drift detection tools to track deviations from IaC definitions.Use runtime compliance monitoring.Integrate with SOAR platforms or incident playbooks.Generate security alerts for real-time remediation and Root Cause Analysis (RCA).Detect & Respond - Supports runtime security through visibility, alerting, and remediation. Enable drift detection tools to track deviations from IaC definitions.Use runtime compliance monitoring (e.g., IBM Cloud SCC).Integrate with SOAR platforms or incident playbooks.Generate security alerts for real-time remediation and RCA.Design Governance—Establishes repeatable, scalable security practices across the enterprise. Promote immutable infrastructure for consistent and tamper-proof environments.Use golden modules or signed templates with organizational guardrails.Implement change management via GitOps, PR workflows, and approval gates.Align with compliance standards (e.g., CIS, NIST, ISO 27001) and produce audit reports. Anatomy of Secure IaC Creating a secure IaC environment involves incorporating several best practices and tools to ensure that the infrastructure is resilient, compliant, and protected against potential threats. These practices are implemented and tracked at various phases of IaC environment lifecycle. Design phase of IaC involves not just identifying the IaC script design and tools decision but also includes the design of incorporating organizational policies into the IaC scripts.Development phase of IaC involves the coding best practices, implementing IaC scripts and policies involved, and also the pre-commit checks that the developer can run before committing. These checks help a clean code check-in and detect the code smells upfront.Build phase of IaC involves all the code security checks and policy verification. This is a quality gate in the pipeline that stops the deployment on any failures.Deployment phase of IaC supports deployment to various environments along with their respective configurations.Maintenance phase of IaC is also a crucial phase, as threat detection, vulnerability detection, and monitoring play a key role. Key Pillars of Secure IaC Below is a list of key pillars of Secure IaC, incorporating all the essential tools and services. These pillars align with cloud-native capabilities to enforce a secure-by-design, shift-left approach for Infrastructure as Code: Reference templates like Deployable Architectures or AWS Terraform Modules. Reusable, templatized infrastructure blueprints designed for security, compliance, and scalability.Promotes consistency across environments (dev/test/prod).Often include pre-approved Terraform templates.Managed IaC platformsllike IBM Cloud Schematics or AWS CloudFormation Enables secure execution of Terraform code in isolated workspaces.Supports: Role-Based Access Control (RBAC)Encrypted variablesApproval workflows (via GitOps or manual)Versioned infrastructure plansLifecycle resource management using IBM Cloud Projects or Azure Blueprints Logical grouping of cloud resources tied to governance and compliance requirements.Simplifies multi-environment deployments (e.g. dev, QA, prod).Integrates with IaC deployment and CI/CD for isolated, secure automation pipelines.Secrets Management Centralized secrets vault to manage: API keysCertificatesIAM credentialsProvides dynamic secrets, automatic rotation, access logging, and fine-grained access policies.Key Management Solutions (KMS/HSM) Protect sensitive data at rest or in transit Manages encryption keys with full customer control and auditability.KMS-backed encryption is critical for storage, databases, and secrets.Compliance Posture Management Provides posture management and continuous compliance monitoring.Enables: Policy-as-Code checks on IaC deploymentsCustom rules enforcementCompliance posture dashboards (CIS, NIST, GDPR)Introduce Continuous Compliance (CC) pipelines as part of the CI/CD pipelines for shift-left enforcement.CI/CD Pipelines (DevSecOps) Integrate security scans and controls into delivery pipelines using GitHub Actions, Tekton, Jenkins, or IBM Cloud Continuous DeliveryPipeline stages include: Terraform lintingStatic analysis (Checkov, tfsec)Secrets scanningCompliance policy validationChange approval gates before Schematics applyPolicy-as-Code Use tools like OPA (Open Policy Agent) policies to: Block insecure resource configurationsRequire tagging, encryption, and access policiesAutomate compliance enforcement during plan and applyIAM & Resource Access Governance Apply least privilege IAM roles for projects, and API keys.Use resource groups to scope access boundaries.Enforce fine-grained access to Secrets Manager, KMS, and Logs.Audit and Logging Integrate with Cloud Logs to: Monitor infrastructure changesAudit access to secrets, projects, and deploymentsDetect anomalies in provisioning behaviorMonitoring and Drift Detection Use monitoring tools like IBM Instana, Drift Detection, or custom Terraform state validation to: Continuously monitor deployed infrastructureCompare live state to defined IaCRemediate unauthorized changes Checklist: Secure IaC 1. Code Validation and Static Analysis Integrate static analysis tools (e.g., Checkov, TFSec) into your development workflow. Scan Terraform templates for misconfigurations and security vulnerabilities. Ensure compliance with best practices and CIS benchmarks. 2. Policy-as-Code Enforcement Define security policies using Open Policy Agent (OPA) or other equivalent tools. Enforce policies during the CI/CD pipeline to prevent non-compliant deployments. Regularly update and audit policies to adapt to evolving security requirements. 3. Secrets and Credential Management Store sensitive information in Secrets Manager. Avoid hardcoding secrets in IaC templates. Implement automated secret rotation and access controls. 4. Immutable Infrastructure and Version Control Maintain all IaC templates in a version-controlled repository (e.g., Git). Implement pull request workflows with mandatory code reviews. Tag and document releases for traceability and rollback capabilities. 5. CI/CD Integration with Security Gates Incorporate security scans and compliance checks into the CI/CD pipeline. Set up approval gates to halt deployments on policy violations. Automate testing and validation of IaC changes before deployment. 6. Secure Execution Environment Utilize IBM Cloud Schematics or AWS Cloud Formation or any equivalent tool for executing Terraform templates in isolated environments. Restrict access to execution environments using IAM roles and policies. Monitor and log all execution activities for auditing purposes. 7. Drift Detection and Continuous Monitoring Implement tools to detect configuration drift between deployed resources and IaC templates. Regularly scan deployed resources for compliance. Set up alerts for unauthorized changes or policy violations. Benefits of Shift-Left Secure IaC Here are the key benefits of adopting Shift-Left Secure IaC, tailored for cloud-native teams focused on automation, compliance, and developer enablement: Early Risk Detection and RemediationFaster, More Secure DeploymentsAutomated Compliance EnforcementReduced Human Error and Configuration DriftImproved Developer ExperienceEnhanced Auditability and TraceabilityReduced Cost of Security FixesStronger Governance with IAM and RBACContinuous Posture Assurance Conclusion Adopting a shift-left approach to secure IaC in cloud platforms isn’t just about preventing mis-configurations—it’s about building smarter from the start. When security is treated as a core part of the development process rather than an afterthought, teams can move faster with fewer surprises down the line. With cloud services like Schematics, Projects, Secrets Manager, Key Management, Cloud Formation, and Azure Blueprints, organizations have all the tools they need to catch issues early, stay compliant, and automate guardrails. However, the true benefit extends beyond security—it establishes the foundation for platform engineering. By baking secure, reusable infrastructure patterns into internal developer platforms, teams create a friction-less, self-service experience that helps developers ship faster without compromising governance.
When it comes to auditing and monitoring database activity, Amazon Aurora's Database Activity Stream (DAS) provides a secure and near real-time stream of database activity. By default, DAS encrypts all data in transit using AWS Key Management Service (KMS) with a customer-managed key (CMK) and streams this encrypted data into a Serverless Streaming Data Service - Amazon Kinesis. While this is great for compliance and security, reading and interpreting the encrypted data stream requires additional effort — particularly if you're building custom analytics, alerting, or logging solutions. This article walks you through how to read the encrypted Aurora DAS records from Kinesis using the AWS Encryption SDK. Security and compliance are top priorities when working with sensitive data in the cloud — especially in regulated industries such as finance, healthcare, and government. Amazon Aurora's DAS is designed to help customers monitor database activity in real time, providing deep visibility into queries, connections, and data access patterns. However, this stream of data is encrypted in transit by default using a customer-managed AWS KMS (Key Management Service) key and routed through Amazon Kinesis Data Streams for consumption. While this encryption model enhances data security, it introduces a technical challenge: how do you access and process the encrypted DAS data? The payload cannot be directly interpreted, as it's wrapped in envelope encryption and protected by your KMS CMK. Understanding the Challenge Before discussing the solution, it's important to understand how Aurora DAS encryption works: Envelope Encryption Model: Aurora DAS uses envelope encryption, where the data is encrypted with a data key, and that data key is itself encrypted using your KMS key. Two Encrypted Components: Each record in the Kinesis stream contains: The database activity events encrypted with a data key The data key encrypted with your KMS CMK Kinesis Data Stream Format: The records follow this structure: JSON { "type": "DatabaseActivityMonitoringRecords", "version": "1.1", "databaseActivityEvents": "[encrypted audit records]", "key": "[encrypted data key]" } Solution Overview: AWS Encryption SDK Approach Aurora DAS encrypts data in multiple layers, and the AWS Encryption SDK helps you easily unwrap all that encryption so you can see what’s going on. Here's why this specific approach is required: Handles Envelope Encryption: The SDK is designed to work with the envelope encryption pattern used by Aurora DAS. Integrates with KMS: It seamlessly integrates with your KMS keys for the initial decryption of the data key. Manages Cryptographic Operations: The SDK handles the complex cryptographic operations required for secure decryption. The decryption process follows these key steps: First, decrypt the encrypted data key using your KMS CMK. Then, use that decrypted key to decrypt the database activity events.Finally, decompress the decrypted data to get the readable JSON output Implementation Step 1: Set Up Aurora With Database Activity Streams Before implementing the decryption solution, ensure you have: An Aurora PostgreSQL or MySQL cluster with sufficient permissions A customer-managed KMS key for encryption Database Activity Streams enabled on your Aurora cluster When you turn on DAS, AWS sets up a Kinesis stream called aws-rds-das-[cluster-resource-id] that receives the encrypted data. Step 2: Prepare the AWS Encryption SDK Environment For decrypting DAS events, your processing application (typically a Lambda function) needs the AWS Encryption SDK. This SDK is not included in standard AWS runtimes and must be added separately. Why this matters: The AWS Encryption SDK provides specialized cryptographic algorithms and protocols designed specifically for envelope encryption patterns used by AWS services like DAS. The most efficient approach is to create a Lambda Layer containing: aws_encryption_sdk: Required for the envelope decryption process boto3: Needed for AWS service interactions, particularly with KMS Step 3: Implement the Decryption Logic Here’s a Lambda function example that handles decrypting DAS events. Each part of the decryption process is thoroughly documented with comments in the code: Python import base64 import json import zlib import boto3 import aws_encryption_sdk from aws_encryption_sdk import CommitmentPolicy from aws_encryption_sdk.internal.crypto import WrappingKey from aws_encryption_sdk.key_providers.raw import RawMasterKeyProvider from aws_encryption_sdk.identifiers import WrappingAlgorithm, EncryptionKeyType # Configuration - update these values REGION_NAME = 'your-region' # Change to your region RESOURCE_ID = 'your cluster resource ID' # Change to your RDS resource ID # Initialize encryption client with appropriate commitment policy # This is required for proper operation with the AWS Encryption SDK enc_client = aws_encryption_sdk.EncryptionSDKClient(commitment_policy=CommitmentPolicy.FORBID_ENCRYPT_ALLOW_DECRYPT) # Custom key provider class for decryption # This class is necessary to use the raw data key from KMS with the Encryption SDK class MyRawMasterKeyProvider(RawMasterKeyProvider): provider_id = "BC" def __new__(cls, *args, **kwargs): obj = super(RawMasterKeyProvider, cls).__new__(cls) return obj def __init__(self, plain_key): RawMasterKeyProvider.__init__(self) # Configure the wrapping key with proper algorithm for DAS decryption self.wrapping_key = WrappingKey( wrapping_algorithm=WrappingAlgorithm.AES_256_GCM_IV12_TAG16_NO_PADDING, wrapping_key=plain_key, wrapping_key_type=EncryptionKeyType.SYMMETRIC ) def _get_raw_key(self, key_id): # Return the wrapping key when the Encryption SDK requests it return self.wrapping_key # First decryption step: use the data key to decrypt the payload def decrypt_payload(payload, data_key): # Create a key provider using our decrypted data key my_key_provider = MyRawMasterKeyProvider(data_key) my_key_provider.add_master_key("DataKey") # Decrypt the payload using the AWS Encryption SDK decrypted_plaintext, header = enc_client.decrypt( source=payload, materials_manager=aws_encryption_sdk.materials_managers.default.DefaultCryptoMaterialsManager( master_key_provider=my_key_provider) ) return decrypted_plaintext # Second step: decompress the decrypted data # DAS events are compressed before encryption to save bandwidth def decrypt_decompress(payload, key): decrypted = decrypt_payload(payload, key) # Use zlib with specific window bits for proper decompression return zlib.decompress(decrypted, zlib.MAX_WBITS + 16) # Main Lambda handler function that processes events from Kinesis def lambda_handler(event, context): session = boto3.session.Session() kms = session.client('kms', region_name=REGION_NAME) for record in event['Records']: # Step 1: Get the base64-encoded data from Kinesis payload = base64.b64decode(record['kinesis']['data']) record_data = json.loads(payload) # Step 2: Extract the two encrypted components payload_decoded = base64.b64decode(record_data['databaseActivityEvents']) data_key_decoded = base64.b64decode(record_data['key']) # Step 3: Decrypt the data key using KMS # This is the first level of decryption in the envelope model data_key_decrypt_result = kms.decrypt( CiphertextBlob=data_key_decoded, EncryptionContext={'aws:rds:dbc-id': RESOURCE_ID} ) decrypted_data_key = data_key_decrypt_result['Plaintext'] # Step 4: Use the decrypted data key to decrypt and decompress the events # This is the second level of decryption in the envelope model decrypted_event = decrypt_decompress(payload_decoded, decrypted_data_key) # Step 5: Process the decrypted event # At this point, decrypted_event contains the plaintext JSON of database activity print(decrypted_event) # Additional processing logic would go here # For example, you might: # - Parse the JSON and extract specific fields # - Store events in a database for analysis # - Trigger alerts based on suspicious activities return { 'statusCode': 200, 'body': json.dumps('Processing Complete') } Step 4: Error Handling and Performance Considerations As you implement this solution in production, keep these key factors in mind: Error Handling: KMS permissions: Ensure your Lambda function has the necessary KMS permissions so it can decrypt the data successfully.Encryption context: The context must match exactly (aws:rds:dbc-id) Resource ID: Make sure you're using the correct Aurora cluster resource ID—if it's off, the KMS decryption step will fail. Performance Considerations: Batch size: Configure appropriate Kinesis batch sizes for your Lambda Timeout settings: Decryption operations may require longer timeouts Memory allocation: Processing encrypted streams requires more memory Conclusion Aurora's Database Activity Streams provide powerful auditing capabilities, but the default encryption presents a technical challenge for utilizing this data. By leveraging the AWS Encryption SDK and understanding the envelope encryption model, you can successfully decrypt and process these encrypted streams. The key takeaways from this article are: Aurora DAS uses a two-layer envelope encryption model that requires specialized decryption The AWS Encryption SDK is essential for properly handling this encryption pattern The decryption process involves first decrypting the data key with KMS, then using that key to decrypt the actual events Proper implementation enables you to unlock valuable database activity data for security monitoring and compliance By following this approach, you can build robust solutions that leverage the security benefits of encrypted Database Activity Streams while still gaining access to the valuable insights they contain.
Scaling microservices for holiday peak traffic is crucial to prevent downtime and ensure a seamless user experience. This guide explores Azure DevOps automation, CI/CD pipelines, and cost-optimization strategies to handle high-demand traffic seamlessly. Manual scaling quickly becomes a bottleneck as organizations deploy dozens, sometimes hundreds, of microservices powered by distinct backend services like Cosmos DB, Event Hubs, App Configuration, and Traffic Manager. Multiple teams juggling these components risk costly delays and errors at the worst possible moments. This is where automation comes in: a game-changing solution that transforms complex, error-prone processes into streamlined, efficient operations. In this article, you’ll explore how automated pipelines can not only safeguard your systems during peak traffic but also optimize costs and boost overall performance in this Microservice world. The Challenge in a Microservices World Imagine a project with over 100 microservices, each maintained by different engineering teams. Every service may have its backend components, for example, as shown below: Cosmos DB: Used for storing data with low-latency access and high throughput.Event Hubs: Ingests telemetry and log data from distributed services.App Configuration: Centrally manages application settings and feature flags.Traffic Manager: Routes user traffic to healthy endpoints during failures. Manual Scaling Is Inefficient Coordinating these tasks manually is cumbersome, especially when production issues arise. With multiple teams, interacting and collaborating on each microservice’s scaling and configuration can be overwhelming. This is where CI/CD pipelines and Infrastructure-as-Code (IaC) automation become crucial. Automation not only reduces human error but also provides a unified approach for rapid, reliable scaling and updates. Figure 1: A system overview showing how the Web App (Presentation Layer) interacts with microservices (Business Logic Layer), which use Cosmos DB, Event Hubs, and App Configuration (Data Layer). The Integration & Traffic Management layer, including Traffic Manager and Azure DevOps CI/CD, handles traffic routing, deployments, and Slack notifications. Understanding Each Component AKS (Azure Kubernetes Service) AKS is a managed Kubernetes service that simplifies deploying, scaling, and managing containerized applications. In a microservices environment, each service can be deployed as a container within AKS, with independent scaling rules and resource allocation. This flexibility enables you to adjust the number of pods based on real-time demand, ensuring that each service has the computing resources it needs. Cosmos DB Azure Cosmos DB is a globally distributed, multi-model NoSQL database service that delivers low latency and high throughput. In a microservices architecture, each service may have its own Cosmos DB instance to handle specific data workloads. Automation scripts can dynamically adjust throughput to meet changing demand, ensuring your service remains responsive even during peak loads. Event Hubs Azure Event Hubs is a high-throughput data streaming service designed to ingest millions of events per second. It’s particularly useful in microservices for collecting logs, telemetry, and real-time analytics data. By automating the scaling of Event Hubs, you ensure that your data ingestion pipeline never becomes a bottleneck, even when the number of events spikes during high-traffic periods. App Configuration Azure App Configuration is a centralized service that stores configuration settings and feature flags for your applications. In a microservices ecosystem, different services often need unique settings or dynamic feature toggles. Instead of hard-coding these values or updating configurations manually, App Configuration provides a single source of truth that can be updated on the fly. During peak traffic, a microservice can instantly disable resource-heavy features without redeployment. Traffic Manager Azure Traffic Manager is a DNS-based load-balancing solution that directs user traffic based on endpoint health and performance. For microservices, it ensures that requests are automatically rerouted from failing or overloaded endpoints to healthy ones, minimizing downtime and ensuring a seamless user experience, especially during high-stress scenarios like holiday peak traffic. The Traffic Manager ensures disaster recovery by rerouting traffic from a failed region (e.g., East US) to a healthy backup (e.g., West US) in under 30 seconds, thereby minimizing downtime. Figure 2: High-level view of user traffic flowing through Azure Traffic Manager to an AKS cluster with containerized microservices, which interact with Cosmos DB, Event Hubs, and App Configuration for data, logging, and real-time updates. Automating the Process With CI/CD Pipelines Leveraging Azure DevOps CI/CD pipelines is the backbone of this automation. Here’s how each part fits into the overall process: Continuous integration (CI): Every code commit triggers a CI pipeline that builds and tests your application. This immediate feedback loop ensures that only validated changes move forward.Continuous delivery (CD): Once the CI pipeline produces an artifact, the release pipeline deploys it to production. This deployment stage automatically scales resources (like Cosmos DB and Event Hubs), updates configurations, and manages traffic routing. Dynamic variables, secure service connections, and agent configurations are all set up to interact seamlessly with AKS, Cosmos DB, and other services.Service connections and Slack notifications: Secure service connections (using a service account or App Registration) enable your pipeline to interact with AKS and other resources. Integration with Slack provides real-time notifications on pipeline runs, scaling updates, and configuration changes, keeping your teams informed. Figure 3: Component Diagram — A high-level architectural overview showing Azure DevOps, AKS, Cosmos DB, Event Hubs, App Configuration, Traffic Manager, and Slack interconnected. Core Automation Commands and Validation Below are the essential commands or code for each component, along with validation commands that confirm each update was successful. 1. Kubernetes Pod Autoscaling (HPA) Core Commands Shell # Update HPA settings: kubectl patch hpa <deploymentName> -n <namespace> - patch '{"spec": {"minReplicas": <min>, "maxReplicas": <max>}' # Validate update: kubectl get hpa <deploymentName> -n <namespace> -o=jsonpath='{.spec.minReplicas}{"-"}{.spec.maxReplicas}{"\n"}' #Expected Output: 3–10 Bash Script for AKS Autoscaling Here’s a shell script for the CI/CD pipeline. This is an example that can be adapted for other automation tasks using technologies such as Terraform, Python, Java, and others. Shell #!/bin/bash # File: scaling-pipeline-details.sh # Input file format: namespace:deploymentname:min:max echo "Logging all application HPA pod count before update" kubectl get hpa --all-namespaces -o=jsonpath='{range .items[*]}{.metadata.namespace}{":"}{.metadata.name}{":"}{.spec.minReplicas}{":"}{.spec.maxReplicas}{"\n"}{end}' cd $(System.DefaultWorkingDIrectory)$(working_dir) INPUT=$(inputfile) OLDIFS=$IFS IFS=':' [ ! -f $INPUT ] && { echo "$INPUT file not found"; exit 99; } while read namespace deploymentname min max do echo "Namespace: $namespace - Deployment: $deploymentname - min: $min - max: $max" cp $(template) "patch-template-hpa-sample-temp.json" sed -i "s/<<min>>/$min/g" "patch-template-hpa-sample-temp.json" sed -i "s/<<max>>/$max/g" "patch-template-hpa-sample-temp.json" echo "kubectl patch hpa $deploymentname --patch $(cat patch-template-hpa-sample-temp.json) -n $namespace" kubectl get hpa $deploymentname -n $namespace -o=jsonpath='{.metadata.namespace}{":"}{.metadata.name}{":"}{.spec.minReplicas}{":"}{.spec.maxReplicas}{"%0D%0A"}' >> /app/pipeline/log/hpa_before_update_$(datetime).properties #Main command to patch the scaling configuration kubectl patch hpa $deploymentname --patch "$(cat patch-template-hpa-sample-temp.json)" -n $namespace #Main command to validate the scaling configuration kubectl get hpa $deploymentname -n $namespace -o=jsonpath='{.metadata.namespace}{":"}{.metadata.name}{":"}{.spec.minReplicas}{":"}{.spec.maxReplicas}{"%0D%0A"}' >> /app/pipeline/log/hpa_after_update_$(datetime).properties rm -f "patch-template-hpa-sample-temp.json" "patch-template-hpa-sample-temp.json".bak done < $INPUT IFS=$OLDIFS tempVar=$(cat /app/pipeline/log/hpa_before_update_$(datetime).properties) curl -k --location --request GET "https://slack.com/api/chat.postMessage?token=$(slack_token)&channel=$(slack_channel)&text=------HPA+POD+Count+Before+update%3A------%0D%0ANamespace%3AHPA-Name%3AMinReplicas%3AMaxReplicas%0D%0A${tempVar}&username=<username>&icon_emoji=<emoji>" tempVar=$(cat /app/pipeline/log/hpa_after_update_$(datetime).properties) #below line is optional for slack notification. curl -k --location --request GET "https://slack.com/api/chat.postMessage?token=$(slack_token)&channel=$(slack_channel)&text=------HPA+POD+Count+After+update%3A------%0D%0ANamespace%3AHPA-Name%3AMinReplicas%3AMaxReplicas%0D%0A${tempVar}&username=<username>&icon_emoji=<emoji>" Create file: patch-template-hpa-sample.json JSON {"spec": { "maxReplicas": <<max>>,"minReplicas": <<min>>} 2. Cosmos DB Scaling Core Commands This can be enhanced further in the CI/CD pipeline with different technologies like a shell, Python, Java, etc. Shell # For SQL Database: az cosmosdb sql database throughput update -g <resourceGroup> -a <accountName> -n <databaseName> --max-throughput <newValue> # Validate update: az cosmosdb sql database throughput show -g <resourceGroup> -a <accountName> -n <databaseName> --query resource.autoscaleSettings.maxThroughput -o tsv #Expected Output: 4000 #Input file format: resourceGroup:accountName:databaseName:maxThroughput:dbType:containerName Terraform Code for Cosmos DB Scaling Shell # Terraform configuration for Cosmos DB account with autoscale settings. resource "azurerm_cosmosdb_account" "example" { name = "example-cosmosdb-account" location = azurerm_resource_group.example.location resource_group_name = azurerm_resource_group.example.name offer_type = "Standard" kind = "GlobalDocumentDB" enable_automatic_failover = true consistency_policy { consistency_level = "Session" } } resource "azurerm_cosmosdb_sql_database" "example" { name = "example-database" resource_group_name = azurerm_resource_group.example.name account_name = azurerm_cosmosdb_account.example.name } resource "azurerm_cosmosdb_sql_container" "example" { name = "example-container" resource_group_name = azurerm_resource_group.example.name account_name = azurerm_cosmosdb_account.example.name database_name = azurerm_cosmosdb_sql_database.example.name partition_key_path = "/partitionKey" autoscale_settings { max_throughput = 4000 } } 3. Event Hubs Scaling Core Commands This can be enhanced further in the CI/CD pipeline with different technologies like a shell, Python, Java, etc. Shell # Update capacity: az eventhubs namespace update -g <resourceGroup> -n <namespace> --capacity <newCapacity> --query sku.capacity -o tsv # Validate update: az eventhubs namespace show -g <resourceGroup> -n <namespace> --query sku.capacity -o tsv #Expected Output: 6 4. Dynamic App Configuration Updates Core Commands This can be enhanced further in the CI/CD pipeline with different technologies like a shell, Python, Java, etc. Shell # Export current configuration: az appconfig kv export -n <appconfig_name> --label <label> -d file --path backup.properties --format properties -y # Import new configuration: az appconfig kv import -n <appconfig_name> --label <label> -s file --path <input_file> --format properties -y # Validate update: az appconfig kv export -n <appconfig_name> --label <label> -d file --path afterupdate.properties --format properties -y #Input file format: Key-value pairs in standard properties format (e.g., key=value). 5. Traffic Management and Disaster Recovery (Traffic Switch) Core Commands This can be enhanced further in the CI/CD pipeline with different technologies like a shell, Python, Java, etc. Shell # Update endpoint status: az network traffic-manager endpoint update --endpoint-status <newStatus> --name <endpointName> --profile-name <profileName> --resource-group <resourceGroup> --type <type> --query endpointStatus -o tsv # Validate update: az network traffic-manager endpoint show --name <endpointName> --profile-name <profileName> --resource-group <resourceGroup> --type <type> --query endpointStatus -o tsv #Expected Output: Enabled #Input file format: profileName:resourceGroup:type:status:endPointName Terraform Code for Traffic Manager (Traffic Switch) JSON resource "azurerm_traffic_manager_profile" "example" { name = "example-tm-profile" resource_group_name = azurerm_resource_group.example.name location = azurerm_resource_group.example.location profile_status = "Enabled" traffic_routing_method = "Priority" dns_config { relative_name = "exampletm" ttl = 30 } monitor_config { protocol = "HTTP" port = 80 path = "/" } } resource "azurerm_traffic_manager_endpoint" "primary" { name = "primaryEndpoint" profile_name = azurerm_traffic_manager_profile.example.name resource_group_name = azurerm_resource_group.example.name type = "externalEndpoints" target = "primary.example.com" priority = 1 } resource "azurerm_traffic_manager_endpoint" "secondary" { name = "secondaryEndpoint" profile_name = azurerm_traffic_manager_profile.example.name resource_group_name = azurerm_resource_group.example.name type = "externalEndpoints" target = "secondary.example.com" priority = 2 } Explanation: These Terraform configurations enable autoscaling and efficient resource allocation for Cosmos DB and Traffic Manager. By leveraging IaC, you ensure consistency and optimize costs by provisioning resources dynamically based on demand. How to Reduce Azure Costs With Auto-Scaling Automation improves operational efficiency and plays a key role in cost optimization. In a microservices ecosystem with hundreds of services, even a small reduction in over-provisioned resources can lead to substantial savings over time. By dynamically scaling resources based on demand, you pay only for what you need. By dynamically adjusting resource usage, businesses can significantly reduce cloud costs. Here are concrete examples: Cosmos DB Autoscaling: For instance, if running 4000 RU/s costs $1,000 per month, reducing it to 1000 RU/s during off-peak hours could lower the bill to $400 monthly, leading to $7,200 in annual savings.AKS Autoscaler: Automatically removing unused nodes ensures you only pay for active compute resources, cutting infrastructure costs by 30%. Visualizing the Process: Sequence Diagram To further clarify the workflow, consider including a Sequence Diagram. This diagram outlines the step-by-step process, from code commit to scaling, configuration updates, and notifications, illustrating how automation interconnects these components. For example, the diagram shows: Developer: Commits code, triggering the CI pipeline.CI pipeline: Builds, tests, and publishes the artifact.CD pipeline: Deploys the artifact to AKS, adjusts Cosmos DB throughput, scales Event Hubs, updates App Configuration, and manages Traffic Manager endpoints.Slack: Sends real-time notifications on each step. Such a diagram visually reinforces the process and helps teams quickly understand the overall workflow. Figure 4: Sequence Diagram — A step-by-step flow illustrating the process from code commit through CI/CD pipelines to resource scaling and Slack notifications. Conclusion Automation is no longer a luxury — it’s the cornerstone of resilient and scalable cloud architectures. In this article, I demonstrated how Azure resources such as Cosmos DB, Event Hubs, App Configuration, Traffic Manager, and AKS can be orchestrated with automation using bash shell scripts, Terraform configurations, Azure CLI commands, and Azure DevOps CI/CD pipelines. These examples illustrate one powerful approach to automating microservices operations during peak traffic. While I showcased the Azure ecosystem, the underlying principles of automation are universal. Similar techniques can be applied to other cloud platforms. Whether you’re using AWS with CloudFormation and CodePipeline or Google Cloud with Deployment Manager and Cloud Build, you can design CI/CD workflows that meet your unique needs. Embrace automation to unlock your infrastructure’s full potential, ensuring your applications not only survive high-demand periods but also thrive under pressure. If you found this guide helpful, subscribe to my Medium blog for more insights on cloud automation. Comment below on your experience with scaling applications or share this with colleagues who might benefit! Your feedback is invaluable and helps shape future content, so let’s keep the conversation going. Happy scaling, and may your holiday traffic be ever in your favor! Further Reading and References Azure Kubernetes Service (AKS) Documentation: Guidance on deploying, managing, and scaling containerized applications using Kubernetes.Azure Cosmos DB Documentation: Dive deep into configuring and scaling your Cosmos DB instances.Azure Event Hubs Documentation: Explore high-throughput data streaming, event ingestion, and telemetry.Azure App Configuration Documentation: Best practices for managing application settings and feature flags in a centralized service.Azure Traffic Manager Documentation: Techniques for DNS-based load balancing and proactive endpoint monitoring.Terraform for Azure: Learn how to leverage Infrastructure as Code (IaC) with Terraform to automate resource provisioning and scaling.Azure DevOps Documentation: Understand CI/CD pipelines, automated deployments, and integrations with Azure services.
Distributed consensus is a fundamental concept in distributed computing that refers to the process by which multiple nodes (servers or computers) in a distributed system agree on a single data value or a sequence of actions, ensuring consistency despite the presence of failures or network partitions. In simpler terms, it's the mechanism that allows independent computers to reach agreement on critical data or operations even when some nodes fail or communication is unreliable. The importance of distributed consensus in today's technology landscape cannot be overstated. It serves as the foundation for: Reliability and Fault tolerance: By requiring agreement among nodes, a consensus algorithm allows the system to keep working correctly even if some servers crash or become unreachable. This ensures there’s no single point of failure and the system can survive node outages. Consistency: Consensus guarantees that all non-faulty nodes have the same view of data or the same sequence of events. This is vital for correctness—for example, in a distributed database, every replica should agree on committed transactions. Coordination: Many coordination tasks in a cluster (such as electing a primary leader or agreeing on a config change) are essentially consensus problems. A robust consensus protocol prevents "split-brain" scenarios by ensuring only one leader is chosen and all nodes agree on who it is. This avoids conflicting decisions and keeps the cluster synchronized. Distributed consensus has found applications across numerous domains: Leader election in fault-tolerant environmentsBlockchain technology for decentralized agreement without central authoritiesDistributed databases to maintain consistency across replicasLoad balancing to efficiently distribute workloads across multiple nodesState machine replication for building reliable distributed services Paxos vs Raft: The Battle for Consensus Dominance When it comes to implementing distributed consensus, two algorithms dominate production systems: Paxos and Raft. Let's examine these algorithms and how they compare. Paxos: The Traditional Consensus Algorithm Paxos, developed by Leslie Lamport in 1998, is foundational to distributed systems research and implementation. It enables a group of computers to reach consensus despite unreliable networks, failure-prone computers, and inaccurate clocks. Paxos has become synonymous with distributed consensus but has been criticized for its complexity and difficulty to understand. In Paxos, the consensus process involves several roles: Proposers: Suggest values to be chosenAcceptors: Vote on proposed valuesLearners: Learn about chosen values The algorithm operates in two main phases: a prepare phase and an accept phase, ensuring safety even when multiple leaders attempt to establish consensus simultaneously. Raft: The Understandable Alternative Raft, introduced by Diego Ongaro and John Ousterhout in 2014, was explicitly designed to solve the same problems as Paxos but with a focus on understandability. The creators titled their paper "In Search of an Understandable Consensus Algorithm," highlighting their primary goal. Raft simplifies the consensus process by: Dividing the problem into leader election, log replication, and safetyUsing a more straightforward approach to leader electionEmploying a strong leader model where all changes flow through the leader Key Differences Between Paxos and Raft Despite serving the same purpose, Paxos and Raft differ in several important ways: Leader Election: Raft only allows servers with up-to-date logs to become leaders, whereas Paxos allows any server to be leader provided it then updates its log to ensure it is up-to-date.Voting Behavior: Paxos followers will vote for any candidate, while Raft followers will only vote for a candidate if the candidate's log is at least as up-to-date as their own.Log Replication: If a leader has uncommitted log entries from a previous term, Paxos will replicate them in the current term, whereas Raft will replicate them in their original term.Complexity vs. Efficiency: While Raft is generally considered more understandable, Paxos can be more efficient in certain scenarios. However, Raft's leader election is surprisingly lightweight compared to Paxos since it doesn't require log entries to be exchanged during the election process. Interestingly, research suggests that much of Raft's purported understandability comes from its clear presentation rather than fundamental differences in the underlying algorithm. Recent Distributed Consensus Protocols: Kafka Raft (KRaft) One of the most significant recent developments in distributed consensus is Apache Kafka Raft (KRaft), which represents a fundamental evolution in Apache Kafka's architecture. What is Kafka Raft? KRaft is a consensus protocol introduced in KIP-500 to remove Apache Kafka's dependency on ZooKeeper for metadata management. This change significantly simplifies Kafka's architecture by consolidating responsibility for metadata within Kafka itself, rather than splitting it between two different systems (ZooKeeper and Kafka). How KRaft Works KRaft operates through a new quorum controller service that replaces the previous controller and utilizes an event-based variant of the Raft consensus protocol. Key aspects of KRaft include: Event-Sourced Storage Model: The quorum controller stores its state using an event-sourced approach, ensuring that internal state machines can always be accurately recreated.Metadata Topic: The event log used to store state (also known as the metadata topic) is periodically condensed by snapshots to prevent unlimited growth.Quick Recovery: If a node pauses due to a network partition, it can quickly catch up by accessing the log when it rejoins, significantly decreasing downtime and improving recovery time.Efficient Leadership Changes: Unlike the ZooKeeper-based controller, the quorum controller doesn't need to load state before becoming active. When leadership changes, the new controller already has all committed metadata records in memory. Benefits of KRaft over Traditional Approaches The adoption of KRaft offers several advantages: Simplified Architecture: By eliminating the need for ZooKeeper, KRaft reduces the complexity of Kafka deployments.Improved Scalability: The new architecture enhances Kafka's ability to scale by removing bottlenecks associated with ZooKeeper.Better Maintainability: With fewer components to manage, Kafka clusters become easier to maintain and operate.Enhanced Performance: The event-driven nature of the KRaft protocol improves metadata management performance compared to the previous RPC-based approach.Faster Recovery: The event-sourced model allows for quicker recovery from failures, improving overall system reliability. Conclusion: The Future of Distributed Consensus As distributed systems continue to evolve and scale, distributed consensus remains a critical foundation for building reliable, fault-tolerant applications. The journey from complex algorithms like Paxos to more understandable alternatives like Raft demonstrates the field's maturation and the industry's focus on practical implementations. The development of specialized consensus protocols like KRaft shows how consensus algorithms are being tailored to specific use cases, optimizing for particular requirements rather than applying one-size-fits-all solutions. This trend is likely to continue as more systems adopt consensus-based approaches for reliability. Looking ahead, several developments are shaping the future of distributed consensus: Simplified Implementations: Following Raft's lead, there's a growing emphasis on making consensus algorithms more accessible and easier to implement correctly.Specialized Variants: Domain-specific consensus protocols optimized for particular use cases, like KRaft for Kafka.Integration into Application Frameworks: Consensus mechanisms are increasingly being built directly into application frameworks rather than requiring separate coordination services.Scalability Improvements: Research continues on making consensus algorithms more efficient at scale, potentially reducing the trade-off between consistency and performance. As distributed systems become more prevalent in our computing infrastructure, understanding and implementing distributed consensus effectively will remain a crucial skill for system designers and developers. Whether through classic algorithms like Paxos, more approachable alternatives like Raft, or specialized implementations like KRaft, distributed consensus will continue to serve as the backbone of reliable distributed systems.
You create a well-defined architecture, but how do you enforce this architecture in your code? Code reviews can be used, but wouldn't it be better to verify your architecture automatically? With ArchUnit you can define rules for your architecture by means of unit tests. Introduction The architecture of an application is described in the documentation. This can be a Word document, a PlantUML diagram, a DrawIO diagram, or whatever you like to use. The developers should follow this architecture when building the application. But, we do know that many do not like to read documentation, and therefore, the architecture might not be known to everyone in the team. With the help of ArchUnit, you can define rules for your architecture within a unit test. This is a very convenient way to do so, because the test will fail when an architecture rule is violated. The official documentation and examples of ArchUnit are a good starting point for using ArchUnit. Besides ArchUnit, Taikai will be discussed, which contains some predefined rules for ArchUnit. The sources used in this blog can be found on GitHub. Prerequisites Prerequisites for reading this blog are: Basic knowledge of architecture styles (layered architecture, hexagonal architecture, and so on);Basic knowledge of Maven;Basic knowledge of Java;Basic knowledge of JUnit;Basic knowledge of Spring Boot; Basic Spring Boot App A basic Spring Boot application is used to verify the architecture rules. It is the starting point for every example used and is present in the base package. The package structure is as follows and contains specific packages for the controller, the service, the repository, and the model. Plain Text ├── controller │ └── CustomersController.java ├── model │ └── Customer.java ├── repository │ └── CustomerRepository.java └── service ├── CustomerServiceImpl.java └── CustomerService.java Package Dependency Checks Before getting started with writing the test, the archunit-junit5 dependency needs to be added to the pom. XML <dependency> <groupId>com.tngtech.archunit</groupId> <artifactId>archunit-junit5</artifactId> <version>1.4.0</version> <scope>test</scope> </dependency> The architecture rule to be added will check whether classes that reside in the service package can only be accessed by classes that reside in the controller or service packages. By means of the @AnalyzeClasses annotation, you can determine which packages should be analyzed. The rule itself is annotated with @ArchTest and the rule is written in a very readable way. Java @AnalyzeClasses(packages = "com.mydeveloperplanet.myarchunitplanet.example1") public class MyArchitectureTest { @ArchTest public static final ArchRule myRule = classes() .that().resideInAPackage("..service..") .should().onlyBeAccessed().byAnyPackage("..controller..", "..service.."); } The easiest way is to run this test from within your IDE. You can also run the test by means of Maven. Shell mvn -Dtest=com.mydeveloperplanet.myarchunitplanet.example1.MyArchitectureTest test The test is successful. Add a Util class in the example1.util package, which makes use of the CustomerService class. This is a violation of the architecture rule you just defined. Java public class Util { @Autowired CustomerService customerService; public void doSomething() { // use the CustomerService customerService.deleteCustomer(1L); } } Run the test again, and now it fails with a clear description of what is wrong. Java java.lang.AssertionError: Architecture Violation [Priority: MEDIUM] - Rule 'classes that reside in a package '..service..' should only be accessed by any package ['..controller..', '..service..']' was violated (1 times): Method <com.mydeveloperplanet.myarchunitplanet.example1.util.Util.doSomething()> calls method <com.mydeveloperplanet.myarchunitplanet.example1.service.CustomerService.deleteCustomer(java.lang.Long)> in (Util.java:14) at com.tngtech.archunit.lang.ArchRule$Assertions.assertNoViolation(ArchRule.java:94) at com.tngtech.archunit.lang.ArchRule$Assertions.check(ArchRule.java:86) at com.tngtech.archunit.lang.ArchRule$Factory$SimpleArchRule.check(ArchRule.java:165) at com.tngtech.archunit.lang.syntax.ObjectsShouldInternal.check(ObjectsShouldInternal.java:81) at com.tngtech.archunit.junit.internal.ArchUnitTestDescriptor$ArchUnitRuleDescriptor.execute(ArchUnitTestDescriptor.java:168) at com.tngtech.archunit.junit.internal.ArchUnitTestDescriptor$ArchUnitRuleDescriptor.execute(ArchUnitTestDescriptor.java:151) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) Exclude Test Classes In the example2 package, a CustomerServiceImplTest is added. This test makes use of classes that reside in the services package, but the test itself is located in the example2 package. The same ArchUnit test is used as before. Run the ArchUnit test, and the test fails because CustomerServiceImplTest does not reside in the service package. Java java.lang.AssertionError: Architecture Violation [Priority: MEDIUM] - Rule 'classes that reside in a package '..service..' should only be accessed by any package ['..controller..', '..service..']' was violated (5 times): Method <com.mydeveloperplanet.myarchunitplanet.example2.CustomerServiceImplTest.testCreateCustomer()> calls method <com.mydeveloperplanet.myarchunitplanet.example2.service.CustomerServiceImpl.createCustomer(com.mydeveloperplanet.myarchunitplanet.example2.model.Customer)> in (CustomerServiceImplTest.java:64) Method <com.mydeveloperplanet.myarchunitplanet.example2.CustomerServiceImplTest.testDeleteCustomer()> calls method <com.mydeveloperplanet.myarchunitplanet.example2.service.CustomerServiceImpl.deleteCustomer(java.lang.Long)> in (CustomerServiceImplTest.java:88) Method <com.mydeveloperplanet.myarchunitplanet.example2.CustomerServiceImplTest.testGetAllCustomers()> calls method <com.mydeveloperplanet.myarchunitplanet.example2.service.CustomerServiceImpl.getAllCustomers()> in (CustomerServiceImplTest.java:42) Method <com.mydeveloperplanet.myarchunitplanet.example2.CustomerServiceImplTest.testGetCustomerById()> calls method <com.mydeveloperplanet.myarchunitplanet.example2.service.CustomerServiceImpl.getCustomerById(java.lang.Long)> in (CustomerServiceImplTest.java:53) Method <com.mydeveloperplanet.myarchunitplanet.example2.CustomerServiceImplTest.testUpdateCustomer()> calls method <com.mydeveloperplanet.myarchunitplanet.example2.service.CustomerServiceImpl.updateCustomer(java.lang.Long, com.mydeveloperplanet.myarchunitplanet.example2.model.Customer)> in (CustomerServiceImplTest.java:79) at com.tngtech.archunit.lang.ArchRule$Assertions.assertNoViolation(ArchRule.java:94) at com.tngtech.archunit.lang.ArchRule$Assertions.check(ArchRule.java:86) at com.tngtech.archunit.lang.ArchRule$Factory$SimpleArchRule.check(ArchRule.java:165) at com.tngtech.archunit.lang.syntax.ObjectsShouldInternal.check(ObjectsShouldInternal.java:81) at com.tngtech.archunit.junit.internal.ArchUnitTestDescriptor$ArchUnitRuleDescriptor.execute(ArchUnitTestDescriptor.java:168) at com.tngtech.archunit.junit.internal.ArchUnitTestDescriptor$ArchUnitRuleDescriptor.execute(ArchUnitTestDescriptor.java:151) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) You might want to exclude test classes from the architecture rules checks. This can be done by adding importOptions to the @AnalyzeClasses annotation as follows. Java @AnalyzeClasses(packages = "com.mydeveloperplanet.myarchunitplanet.example2", importOptions = ImportOption.DoNotIncludeTests.class) Run the test again, and now it is successful. Layer Checks ArchUnit provides some built-in checks for different architecture styles like a layered architecture or an onion (hexagonal) architecture. These are present in the Library API. The example3 package is based on the base package code, but in the CustomerRepository, the CustomerService is injected and used in method updateCustomer. This violates the layered architecture principles. Java @Repository public class CustomerRepository { @Autowired private DSLContext dslContext; @Autowired private CustomerServiceImpl customerService; ... public Customer updateCustomer(Long id, Customer customerDetails) { boolean exists = dslContext.fetchExists(dslContext.selectFrom(Customers.CUSTOMERS)); if (exists) { customerService.deleteCustomer(id); dslContext.update(Customers.CUSTOMERS) .set(Customers.CUSTOMERS.FIRST_NAME, customerDetails.getFirstName()) .set(Customers.CUSTOMERS.LAST_NAME, customerDetails.getLastName()) .where(Customers.CUSTOMERS.ID.eq(id)) .returning() .fetchOne(); return customerDetails; } else { throw new RuntimeException("Customer not found"); } } In order to verify any violations, the ArchUnit test makes use of the layeredArchitecture. You define the layers first, and then you add the constraints for each layer. Java @AnalyzeClasses(packages = "com.mydeveloperplanet.myarchunitplanet.example3") public class MyArchitectureTest { @ArchTest public static final ArchRule myRule = layeredArchitecture() .consideringAllDependencies() .layer("Controller").definedBy("..controller..") .layer("Service").definedBy("..service..") .layer("Persistence").definedBy("..repository..") .whereLayer("Controller").mayNotBeAccessedByAnyLayer() .whereLayer("Service").mayOnlyBeAccessedByLayers("Controller") .whereLayer("Persistence").mayOnlyBeAccessedByLayers("Service"); } The test fails because of the lack of access to the service by the Persistence layer. Java java.lang.AssertionError: Architecture Violation [Priority: MEDIUM] - Rule 'Layered architecture considering all dependencies, consisting of layer 'Controller' ('..controller..') layer 'Service' ('..service..') layer 'Persistence' ('..repository..') where layer 'Controller' may not be accessed by any layer where layer 'Service' may only be accessed by layers ['Controller'] where layer 'Persistence' may only be accessed by layers ['Service']' was violated (2 times): Field <com.mydeveloperplanet.myarchunitplanet.example3.repository.CustomerRepository.customerService> has type <com.mydeveloperplanet.myarchunitplanet.example3.service.CustomerServiceImpl> in (CustomerRepository.java:0) Method <com.mydeveloperplanet.myarchunitplanet.example3.repository.CustomerRepository.updateCustomer(java.lang.Long, com.mydeveloperplanet.myarchunitplanet.example3.model.Customer)> calls method <com.mydeveloperplanet.myarchunitplanet.example3.service.CustomerServiceImpl.deleteCustomer(java.lang.Long)> in (CustomerRepository.java:52) at com.tngtech.archunit.lang.ArchRule$Assertions.assertNoViolation(ArchRule.java:94) at com.tngtech.archunit.lang.ArchRule$Assertions.check(ArchRule.java:86) at com.tngtech.archunit.library.Architectures$LayeredArchitecture.check(Architectures.java:347) at com.tngtech.archunit.junit.internal.ArchUnitTestDescriptor$ArchUnitRuleDescriptor.execute(ArchUnitTestDescriptor.java:168) at com.tngtech.archunit.junit.internal.ArchUnitTestDescriptor$ArchUnitRuleDescriptor.execute(ArchUnitTestDescriptor.java:151) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) Taikai The Taikai library provides some predefined rules for various technologies and extends the ArchUnit library. Let's see how this works. First, add the dependency to the pom. XML <dependency> <groupId>com.enofex</groupId> <artifactId>taikai</artifactId> <version>1.8.0</version> <scope>test</scope> </dependency> In the example4 package, you add the following test. As you can see, this test is quite comprehensive. Java class MyArchitectureTest { @Test void shouldFulfillConstraints() { Taikai.builder() .namespace("com.mydeveloperplanet.myarchunitplanet.example4") .java(java -> java .noUsageOfDeprecatedAPIs() .methodsShouldNotDeclareGenericExceptions() .utilityClassesShouldBeFinalAndHavePrivateConstructor() .imports(imports -> imports .shouldHaveNoCycles() .shouldNotImport("..shaded..") .shouldNotImport("org.junit..")) .naming(naming -> naming .classesShouldNotMatch(".*Impl") .methodsShouldNotMatch("^(foo$|bar$).*") .fieldsShouldNotMatch(".*(List|Set|Map)$") .fieldsShouldMatch("com.enofex.taikai.Matcher", "matcher") .constantsShouldFollowConventions() .interfacesShouldNotHavePrefixI())) .logging(logging -> logging .loggersShouldFollowConventions(Logger.class, "logger", List.of(PRIVATE, FINAL))) .test(test -> test .junit5(junit5 -> junit5 .classesShouldNotBeAnnotatedWithDisabled() .methodsShouldNotBeAnnotatedWithDisabled())) .spring(spring -> spring .noAutowiredFields() .boot(boot -> boot .springBootApplicationShouldBeIn("com.enofex.taikai")) .configurations(configuration -> configuration .namesShouldEndWithConfiguration()) .controllers(controllers -> controllers .shouldBeAnnotatedWithRestController() .namesShouldEndWithController() .shouldNotDependOnOtherControllers() .shouldBePackagePrivate()) .services(services -> services .shouldBeAnnotatedWithService() .shouldNotDependOnControllers() .namesShouldEndWithService()) .repositories(repositories -> repositories .shouldBeAnnotatedWithRepository() .shouldNotDependOnServices() .namesShouldEndWithRepository())) .build() .check(); } } Run the test. The test fails because it is not allowed to have classes ending with Impl. The error is similar to that with ArchUnit. Java java.lang.AssertionError: Architecture Violation [Priority: MEDIUM] - Rule 'Classes should not have names matching .*Impl' was violated (1 times): Class <com.mydeveloperplanet.myarchunitplanet.example4.service.CustomerServiceImpl> has name matching '.*Impl' in (CustomerServiceImpl.java:0) at com.tngtech.archunit.lang.ArchRule$Assertions.assertNoViolation(ArchRule.java:94) at com.tngtech.archunit.lang.ArchRule$Assertions.check(ArchRule.java:86) at com.tngtech.archunit.lang.ArchRule$Factory$SimpleArchRule.check(ArchRule.java:165) at com.enofex.taikai.TaikaiRule.check(TaikaiRule.java:66) at com.enofex.taikai.Taikai.lambda$check$1(Taikai.java:70) at java.base/java.lang.Iterable.forEach(Iterable.java:75) at com.enofex.taikai.Taikai.check(Taikai.java:70) at com.mydeveloperplanet.myarchunitplanet.example4.MyArchitectureTest.shouldFulfillConstraints(MyArchitectureTest.java:60) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) However, unlike with ArchUnit, this test fails when the first condition fails. So, you need to fix this one first, run the test again, and the next violation is shown, and so on. I created an improvement issue for this. This issue was fixed and released (v1.9.0) immediately. A new checkAll method is added that checks all rules. Java @Test void shouldFulfillConstraintsCheckAll() { Taikai.builder() .namespace("com.mydeveloperplanet.myarchunitplanet.example4") ... .build() .checkAll(); } Run this test, and all violations are reported. This way, you can fix them all at once. Java java.lang.AssertionError: Architecture Violation [Priority: MEDIUM] - Rule 'All Taikai rules' was violated (7 times): Class <com.mydeveloperplanet.myarchunitplanet.example4.controller.CustomersController> has modifier PUBLIC in (CustomersController.java:0) Class <com.mydeveloperplanet.myarchunitplanet.example4.service.CustomerService> is not annotated with org.springframework.stereotype.Service in (CustomerService.java:0) Class <com.mydeveloperplanet.myarchunitplanet.example4.service.CustomerServiceImpl> does not have name matching '.+Service' in (CustomerServiceImpl.java:0) Class <com.mydeveloperplanet.myarchunitplanet.example4.service.CustomerServiceImpl> has name matching '.*Impl' in (CustomerServiceImpl.java:0) Field <com.mydeveloperplanet.myarchunitplanet.example4.controller.CustomersController.customerService> is annotated with org.springframework.beans.factory.annotation.Autowired in (CustomersController.java:0) Field <com.mydeveloperplanet.myarchunitplanet.example4.repository.CustomerRepository.dslContext> is annotated with org.springframework.beans.factory.annotation.Autowired in (CustomerRepository.java:0) Field <com.mydeveloperplanet.myarchunitplanet.example4.service.CustomerServiceImpl.customerRepository> is annotated with org.springframework.beans.factory.annotation.Autowired in (CustomerServiceImpl.java:0) at com.enofex.taikai.Taikai.checkAll(Taikai.java:102) at com.mydeveloperplanet.myarchunitplanet.example4.MyArchitectureTest.shouldFulfillConstraintsCheckAll(MyArchitectureTest.java:108) at java.base/java.lang.reflect.Method.invoke(Method.java:580) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) at java.base/java.util.ArrayList.forEach(ArrayList.java:1596) Taikai: Issues Fixed In the example5 package, all issues of the Taikai test are fixed. This reveals that some checks do not seem to function correctly. Also for this an issue is registered and again a fast reply of the maintainer. It appeared to be some misunderstanding of how the rules are implemented. Reading the documentation a bit more carefully, the rule failOnEmpty checks whether rules are matched or not matched at all. In the latter case, it is possible that a rule is misconfigured. This is the case with fieldsShouldMatch and springBootApplicationShouldBeIn. A new test is added to show this functionality. Java @Test void shouldFulfillConstraintsFailOnEmpty() { Taikai.builder() .namespace("com.mydeveloperplanet.myarchunitplanet.example5") .failOnEmpty(true) ... } The springBootApplicationShouldBeIn should be configured for the package where the main Spring Boot application should be located. Java .spring(spring -> spring .noAutowiredFields() .boot(boot -> boot .springBootApplicationShouldBeIn("com.mydeveloperplanet.myarchunitplanet.example5")) Conclusion ArchUnit is an easy-to-use library that enforces some architectural rules. A developer will be notified of an architectural violation when the ArchUnit test fails. This ensures that the architecture rules are clear to everyone. The Taikai library provides easy-to-use predefined rules that can be applied immediately without too much configuration.
Stelios Manioudakis, PhD
Lead Engineer,
Technical University of Crete
Faisal Khatri
Senior Testing Specialist,
Kafaat Business Solutions شركة كفاءات حلول الأعمال