Integration refers to the process of combining software parts (or subsystems) into one system. An integration framework is a lightweight utility that provides libraries and standardized methods to coordinate messaging among different technologies. As software connects the world in increasingly more complex ways, integration makes it all possible facilitating app-to-app communication. Learn more about this necessity for modern software development by keeping a pulse on the industry topics such as integrated development environments, API best practices, service-oriented architecture, enterprise service buses, communication architectures, integration testing, and more.
Four Essential Tips for Building a Robust REST API in Java
Using SQS With JMS for Legacy Applications
In this third part of our CDK series, the project cdk-quarkus-s3, in the same GIT repository, will be used to illustrate a couple of advanced Quarkus to AWS integration features, together with several tricks specific to RESTeasy which is, as everyone knows, the RedHat implementation of Jakarta REST specifications. Let's start by looking at the project's pom.xml file which drives the Maven build process. You'll see the following dependency: ... <dependency> <groupId>io.quarkiverse.amazonservices</groupId> <artifactId>quarkus-amazon-s3</artifactId> </dependency> <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-amazon-lambda-http</artifactId> </dependency> <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-rest-jackson</artifactId> </dependency> <dependency> <groupId>io.quarkus</groupId> <artifactId>quarkus-rest-client</artifactId> </dependency> ... <dependency> <groupId>software.amazon.awssdk</groupId> <artifactId>netty-nio-client</artifactId> </dependency> <dependency> <groupId>software.amazon.awssdk</groupId> <artifactId>url-connection-client</artifactId> </dependency> ... The first dependency in the listing above, quarkus-amazon-s3 is a Quarkus extension allowing your code to act as an AWS S3 client and to store and delete objects in buckets or implement backup and recovery strategies, archive data, etc. The next dependency, quarkus-amazon-lambda-http, is another Quarkus extension that aims at supporting the AWS HTTP Gateway API. As the reader already knows from the two previous parts of this series, with Quarkus, one can deploy a REST API as AWS Lambda using either AWS HTTP Gateway API or AWS REST Gateway API. Here we'll be using the former one, less expansive, hence the mentioned extension. If we wanted to use the AWS REST Gateway API, then we would have had to replace the quarkus-amazon-lambda-http extension by the quarkus-amazon-lambda-rest one. What To Expect In this project, we'll be using Quarkus 3.11 which, at the time of this writing, is the most recent release. Some of the RESTeasy dependencies have changed, compared with former versions, hence the dependency quarkus-rest-jackson which replaces now the quarkus-resteasy one, used in 3.10 and before. Also, the quarkus-rest-client extension, implementing the Eclipse MP REST Client specifications, is needed for test purposes, as we will see in a moment. Last but not least, the url-connection-client Quarkus extension is needed because the MP REST Client implementation uses it by default and, consequently, it has to be included in the build process. Now, let's look at our new REST API. Open the Java class S3FileManagementAPI in the cdk-quarkus-s3 project and you'll see that it defines three operations: download file, upload file, and list files. All three use the same S3 bucket created as a part of the CDK application's stack. Java @Path("/s3") public class S3FileManagementApi { @Inject S3Client s3; @ConfigProperty(name = "bucket.name") String bucketName; @POST @Path("upload") @Consumes(MediaType.MULTIPART_FORM_DATA) public Response uploadFile(@Valid FileMetadata fileMetadata) throws Exception { PutObjectRequest request = PutObjectRequest.builder() .bucket(bucketName) .key(fileMetadata.filename) .contentType(fileMetadata.mimetype) .build(); s3.putObject(request, RequestBody.fromFile(fileMetadata.file)); return Response.ok().status(Response.Status.CREATED).build(); } ... } Explaining the Code The code fragment above reproduces only the upload file operation, the other two being very similar. Observe how simple the instantiation of the S3Client is by taking advantage of the Quarkus CDI which avoids the need for several boilerplate lines of code. Also, we're using the Eclipse MP Config specification to define the name of the destination S3 bucket. Our endpoint uploadFile() accepts POST requests and consumes MULTIPART_FORM_DATA MIME data is structured in two distinct parts, one for the payload and the other one containing the file to be uploaded. The endpoint takes an input parameter of the class FileMetadata, shown below: Java public class FileMetadata { @RestForm @NotNull public File file; @RestForm @PartType(MediaType.TEXT_PLAIN) @NotEmpty @Size(min = 3, max = 40) public String filename; @RestForm @PartType(MediaType.TEXT_PLAIN) @NotEmpty @Size(min = 10, max = 127) public String mimetype; ... } This class is a data object grouping the file to be uploaded together with its name and MIME type. It uses the @RestForm RESTeasy specific annotation to handle HTTP requests that have multipart/form-dataas their content type. The use of jakarta.validation.constraints annotations are very practical as well for validation purposes. To come back at our endpoint above, it creates a PutObjectRequest having as input arguments the destination bucket name, a key that uniquely identifies the stored file in the bucket, in this case, the file name, and the associated MIME type, for example TEXT_PLAIN for a text file. Once the PutObjectRequest created it is sent via an HTTP PUT request to the AWS S3 service. Please notice how easy the file to be uploaded is inserted into the request body using the RequestBody.fromFile(...) statement. That's all as far as the REST API exposed as an AWS Lambda function is concerned. Now let's look at what's new in our CDK application's stack: Java ... HttpApi httpApi = HttpApi.Builder.create(this, "HttpApiGatewayIntegration") .defaultIntegration(HttpLambdaIntegration.Builder.create("HttpApiGatewayIntegration", function).build()).build(); httpApiGatewayUrl = httpApi.getUrl(); CfnOutput.Builder.create(this, "HttpApiGatewayUrlOutput").value(httpApi.getUrl()).build(); ... These lines have been added to the LambdaWithBucketConstruct class in the cdk-simple-construct project. We want the Lambda function we're creating in the current stack to be located behind an HTTP Gateway and backups it. This might have some advantages. So we need to create an integration for our Lambda function. The notion of integration, as defined by AWS, means providing a backend for an API endpoint. In the case of the HTTP Gateway, one or more backends should be provided for each API Gateway's endpoints. The integrations have their own request and responses, distinct from the ones of the API itself. There are two integration types: Lambda integrations where the backend is a Lambda function; HTTP integrations where the backend might be any deployed web application; In our example, we're using Lambda integration, of course. There are two types of Lambda integrations as well: Lambda proxy integration where the definition of the integration's request and response, as well as their mapping to/from the original ones, aren't required as they are automatically provided; Lambda non-proxy integration where we need to explicitly specify how the incoming request data is mapped to the integration request and how the resulting integration response data is mapped to the method response; For simplicity's sake, we're using the 1st case in our project. This is what the statement .defaultIntegration(...) above is doing. Once the integration is created, we need to display the URL of the newly created API Gateway, which our Lambda function is the backup. This way, in addition to being able to directly invoke our Lambda function, as we did previously, we'll be able to do it through the API Gateway. And in a project with several dozens of REST endpoints, it's very important to have a single contact point, where to apply security policies, logging, journalisation, and other cross-cutting concerns. The API Gateway is ideal as a single contact point. The project comes with a couple of unit and integration tests. For example, the class S3FileManagementTest performs unit testing using REST Assured, as shown below: Java @QuarkusTest @TestMethodOrder(MethodOrderer.OrderAnnotation.class) public class S3FileManagementTest { private static File readme = new File("./src/test/resources/README.md"); @Test @Order(10) public void testUploadFile() { given() .contentType(MediaType.MULTIPART_FORM_DATA) .multiPart("file", readme) .multiPart("filename", "README.md") .multiPart("mimetype", MediaType.TEXT_PLAIN) .when() .post("/s3/upload") .then() .statusCode(HttpStatus.SC_CREATED); } @Test @Order(20) public void testListFiles() { given() .when().get("/s3/list") .then() .statusCode(200) .body("size()", equalTo(1)) .body("[0].objectKey", equalTo("README.md")) .body("[0].size", greaterThan(0)); } @Test @Order(30) public void testDownloadFile() throws IOException { given() .pathParam("objectKey", "README.md") .when().get("/s3/download/{objectKey}") .then() .statusCode(200) .body(equalTo(Files.readString(readme.toPath()))); } } This unit test starts by uploading the file README.md to the S3 bucket defined for the purpose. Then it lists all the files present in the bucket and finishes by downloading the file just uploaded. Please notice the following lines in the application.properties file: Plain Text bucket.name=my-bucket-8701 %test.quarkus.s3.devservices.buckets=${bucket.name} The first one defines the names of the destination bucket and the second one automatically creates it. This only works while executed via the Quarkus Mock server. While this unit test is executed in the Maven test phase, against a localstackinstance run by testcontainers, automatically managed by Quarkus, the integration one, S3FileManagementIT, is executed against the real AWS infrastructure, once our CDK application is deployed. The integration tests use a different paradigm and, instead of REST Assured, very practical for unit tests, they take advantage of the Eclipse MP REST Client specifications, implemented by Quarkus, as shown in the following snippet: Java @QuarkusTest @TestMethodOrder(MethodOrderer.OrderAnnotation.class) public class S3FileManagementIT { private static File readme = new File("./src/test/resources/README.md"); @Inject @RestClient S3FileManagementClient s3FileManagementTestClient; @Inject @ConfigProperty(name = "base_uri/mp-rest/url") String baseURI; @Test @Order(40) public void testUploadFile() throws Exception { Response response = s3FileManagementTestClient.uploadFile(new FileMetadata(readme, "README.md", MediaType.TEXT_PLAIN)); assertThat(response).isNotNull(); assertThat(response.getStatusInfo().toEnum()).isEqualTo(Response.Status.CREATED); } ... } We inject S3FileManagementClient which is a simple interface defining our API endpoints and Quarkus does the rest. It generates the required client code. We just have to invoke endpoints on this interface, for example uploadFile(...), and that's all. Have a look at S3FileManagementClient, in the cdk-quarkus-s3 project, to see how everything works and please notice how the annotation @RegisterRestClient defines a configuration key, named base_uri, used further in the deploy.sh script. Now, to test against the AWS real infrastructure, you need to execute the deploy.sh script, as follows: Shell $ cd cdk $ ./deploy.sh cdk-quarkus/cdk-quarkus-api-gateway cdk-quarkus/cdk-quarkus-s3 This will compile and build the application, execute the unit tests, deploy the CloudFormation stack on AWS, and execute the integration tests against this infrastructure. At the end of the execution, you should see something like: Plain Text Outputs: QuarkusApiGatewayStack.FunctionURLOutput = https://<generated>.lambda-url.eu-west-3.on.aws/ QuarkusApiGatewayStack.LambdaWithBucketConstructIdHttpApiGatewayUrlOutput = https://<generated>.execute-api.eu-west-3.amazonaws.com/ Stack ARN: arn:aws:cloudformation:eu-west-3:...:stack/QuarkusApiGatewayStack/<generated> Now, in addition to the Lambda function URL that you've already seen in our previous examples, you can see how the API HTTP Gateway URL, that you can use now for testing purposes, instead of the Lambda one. An E2E test case, exported from Postman (S3FileManagementPostmanIT), is provided as well. It is executed via the Docker image postman/newman:latest, running in testcontainers. Here is a snippet: Java @QuarkusTest public class S3FileManagementPostmanIT { ... private static GenericContainer<?> postman = new GenericContainer<>("postman/newman") .withNetwork(Network.newNetwork()) .withCopyFileToContainer(MountableFile.forClasspathResource("postman/AWS.postman_collection.json"), "/etc/newman/AWS.postman_collection.json") .withStartupCheckStrategy(new OneShotStartupCheckStrategy().withTimeout(Duration.ofSeconds(10))); @Test public void run() { String apiEndpoint = System.getenv("API_ENDPOINT"); assertThat(apiEndpoint).isNotEmpty(); postman.withCommand("run", "AWS.postman_collection.json", "--global-var base_uri=" + apiEndpoint.substring(8).replaceAll(".$", "")); postman.start(); LOG.info(postman.getLogs()); assertThat(postman.getCurrentContainerInfo().getState().getExitCodeLong()).isZero(); postman.stop(); } } Conclusion As you can see, after starting the postman/newman:latest image with testcontainers, we run the E2E test case exported from Postman by passing to it the option global-vars such that to initialize the global variable labeled base_uri to the value of the REST API URL saved by the deploy.sh script in the API-ENDPOINT environment variable. Unfortunately, due probably to a bug, the postman/newman image doesn't recognize this option, accordingly, waiting for this issue to be fixed, this test is disabled for now. You can, of course, import the file AWS.postman_collection.json in Postman and run it this way after having replaced the global variable {{base_uri} with the current value of the API URL generated by AWS. Enjoy!
Cross-Origin Resource Sharing (CORS) is an essential security mechanism utilized by web browsers, allowing for regulated access to server resources from origins that differ in domain, protocol, or port. In the realm of APIs, especially when utilizing AWS API Gateway, configuring CORS is crucial to facilitate access for web applications originating from various domains while mitigating potential security risks. This article aims to provide a comprehensive guide on CORS and integrating AWS API Gateway through CloudFormation. It will emphasize the significance of CORS, the development of authorization including bearer tokens, and the advantages of selecting optional methods in place of standard GET requests. Why CORS Matters In the development of APIs intended for access across various domains, CORS is essential in mitigating unauthorized access. By delineating the specific domains permitted to interact with your API, you can protect your resources from Cross-Site Request Forgery (CSRF) attacks while allowing valid cross-origin requests. Benefits of CORS Security: CORS plays a crucial role in regulating which external domains can access your resources, thereby safeguarding your API against harmful cross-origin requests. Flexibility: CORS allows you to define varying levels of access (such as methods like GET, POST, DELETE, etc.) for different origins, offering adaptability based on your specific requirements. User experience: Implementing CORS enhances user experience by allowing users to seamlessly access resources from multiple domains without encountering access-related problems. Before we proceed with setting up CORS, we need to understand the need to use optional methods over GET. This comparison helps in quickly comparing the aspects of using GET versus optional methods (PUT, POST, OPTIONS) in API requests. Reason GET Optional Methods (POST, PUT, OPTIONS) Security GET requests are visible in the browser's address bar and can be cached, making it less secure for sensitive information. Optional methods like POST and PUT are not visible in the address bar and are not cached, providing more security for sensitive data. Flexibility GET requests are limited to sending data via the URL, which restricts the complexity and size of data that can be sent. Optional methods allow sending complex data structures in the request body, providing more flexibility. Idempotency and Safety GET is idempotent and considered safe, meaning it does not modify the state of the resource. POST and PUT are used for actions that modify data, and OPTIONS are used for checking available methods. CORS Preflight GET requests are not typically used for CORS preflight checks. OPTIONS requests are crucial for CORS preflight checks, ensuring that the actual request can be made. Comparison between POST and PUT methods, the purposes and behavior: Aspect POST PUT Purpose Used to create a new resource. Used to update an existing resource or create it if it doesn't exist. Idempotency Not idempotent; multiple identical requests may create multiple resources. Idempotent; multiple identical requests will not change the outcome beyond the initial change. Resource Location The server decides the resource's URI, typically returning it in the response. The client specifies the resource's URI. Data Handling Typically used when the client does not know the URI of the resource in advance. Typically used when the client knows the URI of the resource and wants to update it. Common Use Case Creating new records, such as submitting a form to create a new user. Updating existing records, such as editing user information. Caching Responses to POST requests are generally not cached. Responses to PUT requests can be cached as the request should result in the same outcome. Response Usually returns a status code of 201 (Created) with a location header pointing to the newly created resource. Usually returns a status code of 200 (OK) or 204 (No Content) if the update is successful. Setting Up CORS in AWS API Gateway Using CloudFormation Configuring CORS in AWS API Gateway can be accomplished manually via the AWS Management Console; however, automating this process with CloudFormation enhances both scalability and consistency. Below is a detailed step-by-step guide: 1. Define the API Gateway in CloudFormation Start by defining the API Gateway in your CloudFormation template: YAML Resources: MyApi: Type: AWS::ApiGateway::RestApi Properties: Name: MyApi 2. Create Resources and Methods Define the resources and methods for your API. For example, create a resource for /items and a GET method: YAML ItemsResource: Type: AWS::ApiGateway::Resource Properties: ParentId: !GetAtt MyApi.RootResourceId PathPart: items RestApiId: !Ref MyApi GetItemsMethod: Type: AWS::ApiGateway::Method Properties: AuthorizationType: NONE HttpMethod: GET ResourceId: !Ref ItemsResource RestApiId: !Ref MyApi Integration: Type: MOCK IntegrationResponses: - StatusCode: 200 MethodResponses: - StatusCode: 200 3. Configure CORS Next, configure CORS for your API method by specifying the necessary headers: YAML OptionsMethod: Type: AWS::ApiGateway::Method Properties: AuthorizationType: NONE HttpMethod: OPTIONS ResourceId: !Ref ItemsResource RestApiId: !Ref MyApi Integration: Type: MOCK RequestTemplates: application/json: '{"statusCode": 200}' IntegrationResponses: - StatusCode: 200 SelectionPattern: '2..' ResponseParameters: method.response.header.Access-Control-Allow-Headers: "'Content-Type,X-Amz-Date,Authorization,X-Api-Key,X-Amz-Security-Token'" method.response.header.Access-Control-Allow-Methods: "'*'" method.response.header.Access-Control-Allow-Origin: "'*'" MethodResponses: - StatusCode: 200 ResponseModels: { "application/json": "Empty" } ResponseParameters: method.response.header.Access-Control-Allow-Headers: false method.response.header.Access-Control-Allow-Methods: false method.response.header.Access-Control-Allow-Origin: false Incorporating Authorization Implementing authorization within your API methods guarantees that access to specific resources is restricted to authenticated and authorized users. The AWS API Gateway offers various authorization options, including AWS Lambda authorizers, Cognito User Pools, and IAM roles. YAML MyAuthorizer: Type: AWS::ApiGateway::Authorizer Properties: Name: MyLambdaAuthorizer RestApiId: !Ref MyApi Type: TOKEN AuthorizerUri: arn:aws:apigateway:<region>:lambda:path/2015-03-31/functions/<lambda_arn>/invocations GetItemsMethodWithAuth: Type: AWS::ApiGateway::Method Properties: AuthorizationType: CUSTOM AuthorizerId: !Ref MyAuthorizer HttpMethod: GET ResourceId: !Ref ItemsResource RestApiId: !Ref MyApi Integration: Type: AWS_PROXY IntegrationHttpMethod: POST Uri: !Sub arn:aws:apigateway:${AWS::Region}:lambda:path/2015-03-31/functions/${MyFunction.Arn}/invocations MethodResponses: - StatusCode: 200 After implementation, here's how the API looks in AWS: Integration request: API Gateway Documentation can be found here: Amazon API. Conclusion Establishing CORS and integrating AWS API Gateway through CloudFormation offers an efficient and reproducible method for managing API access. By meticulously setting up CORS, you guarantee that your APIs remain secure and are accessible solely to permitted origins. Incorporating authorization adds a layer of security by limiting access to only those users who are authorized. Moreover, evaluating the advantages of utilizing optional methods instead of GET requests ensures that your API maintains both security and the flexibility necessary for managing intricate operations. The implementation of these configurations not only bolsters the security and performance of your API but also enhances the overall experience for end-users, facilitating seamless cross-origin interactions and the appropriate management of sensitive information.
When it comes to secure web applications, we must keep sensitive data secure during the communication period. Sadly, while HTTPS encrypts data as it moves from point A to point B, the information is still exposed in a browser's network tab and can leak out this way. In this post, I will give you an example of implementing end-to-end encryption of API calls in your secure web app built with Angular. Encryption Workflow Weak protections have traditionally been obfuscation with Base64 encoding or custom schemes. Public key cryptography (PKC) is considered a modern solution to be more secure. It uses a key pair one public key for encryption, and the other private key for decryption. A public key is distributed and a private key is kept on the server. The encryption workflow is as follows: Client-side encryption: Your Angular application encrypts the data with the server’s public key before transmitting it to the API. Secure transmission: Over an HTTPS connection, the network then transmits the encrypted data. Server-side decryption: The server decrypts the data by missioning its private key because it receives the encrypted data, seeing the original information. Server-side encryption (Optional): Before sending the response back to the client, the server can also encrypt the data for additional security. Client-side decryption: Finally, the client decrypts the encrypted response from the server using a public key stored in the web application. Implementation Into Angular App Here is a detailed strategy to implement end-to-end encryption in your Angular financial app. 1. Library Selection and Installation Choose a well-maintained library like Angular CryptoJS or Sodium for JavaScript, but still put more reliance than loafing in trying to implement them. These libraries contain the APIs for encryption and decryption which is provided by multiple algorithms. PowerShell npm install crypto-js 2. Key Management Implement a feature to keep the server's public key safe. There are a couple of common approaches to this: Server-side storage: One relatively simple solution is to store the public key on your backend server, and then retrieve it during application initialization in Angular via an HTTPS request. Key Management Service (Optional): For more advanced set ups, consider a KMS dedicated to secret management, but this requires an extra layer. 3. Create Client-Side Encryption and Decryption Service Create a common crypto service to handle the application data encryption and decryption. TypeScript // src/app/services/appcrypto.service.ts import { Injectable } from '@angular/core'; import * as CryptoJS from 'crypto-js'; @Injectable({ providedIn: 'root' }) export class AppCryptoService { private appSerSecretKey: string = 'server-public-key'; encrypt(data: any): string { return CryptoJS.AES.encrypt(JSON.stringify(data), this.appSerSecretKey).toString(); } decrypt(data: string): any { const bytes = CryptoJS.AES.decrypt(data, this.appSerSecretKey); return JSON.parse(bytes.toString(CryptoJS.enc.Utf8)); } } 4. Application API Call Service Create a common service to handle the web application's common HTTP methods. TypeScript // src/app/services/appcommon.service.ts import { Injectable, Inject } from '@angular/core'; import { Observable } from 'rxjs'; import { map } from 'rxjs/operators'; import { HttpClient, HttpHeaders } from '@angular/common/http'; import { AppCryptoService } from '../services/crypto.service'; @Injectable({ providedIn: 'root' }) export class AppCommonService { constructor(private appCryptoService: AppCryptoService private httpClient: HttpClient) {} postData(url: string, data: any): Observable<any> { const encryptedData = this.appCryptoService.encrypt(data); return this.httpClient.post(url, encryptedData); } } 5. Server-Side Decryption On the server side, you have to decrypt all incoming request payloads and encrypt response payloads. Here's an example using Node. js and Express: JavaScript // server.js const express = require('express'); const bodyParser = require('body-parser'); const crypto = require('crypto-js'); const app = express(); const secretKey = 'app-secret-key'; app.use(bodyParser.json()); // Using middleware to decrypt the incoming request bodies app.use((req, res, next) => { if (req.body && typeof req.body === 'string') { const bytes = crypto.AES.decrypt(req.body, secretKey); req.body = JSON.parse(bytes.toString(crypto.enc.Utf8)); } next(); }); // Test post route call app.post('/api/data', (req, res) => { console.log('Decrypted data:', req.body); // response object const responseObj = { message: 'Successfully received' }; // Encrypt the response body (Optional) const encryptedResBody = crypto.AES.encrypt(JSON.stringify(responseBody), secretKey).toString(); res.send(encryptedResBody); }); const PORT = process.env.PORT || 3000; app.listen(PORT, () => { console.log(`Server running on port ${PORT}`); }); 6. Server-Side Encryption (Optional) The response of the server can also be sent back to the client in an encrypted form for security. This does add a layer of security, though with the caveat that it may impact system performance. 7. Client-Side Decryption (Optional) When the response is encrypted from the server, decrypt it on the client side. Conclusion This example keeps it simple by using AES encryption. You may want additional encryption mechanisms, depending on your security needs. Don't forget to manage errors and exceptions properly. This is a somewhat crude implementation of encryption in your Angular web app when making API calls around.
Previous Articles on Snowflake Integrating Snowflake with Trino Previous Articles on CockroachDB CDC Using CockroachDB CDC with Apache Pulsar Using CockroachDB CDC with Azure Event Hubs SaaS Galore: Integrating CockroachDB with Confluent Kafka, FiveTran, and Snowflake Using CockroachDB CDC with Confluent Cloud Kafka and Schema Registry CockroachDB CDC using Minio as cloud storage sink CockroachDB CDC using Hadoop Ozone S3 Gateway as cloud storage sink Motivation I work with financial services clients, and it's common to encounter a need for streaming changes in the operational data store into a data warehouse or a data lake. A former colleague recently reached out for advice on the fastest and most efficient way to load trade data into Snowflake. I've come up with at least three methods, which I will explore in a follow-up series of articles. However, I've decided to first explore Redpanda Connect, a solution that has recently caught my attention. This is by no means a conclusive guide on how changefeed data must be loaded into Snowflake; we're merely exploring the possibilities and discussing the pros and cons in later articles. CockroachDB changefeeds are an enterprise feature and require a license. In this tutorial, I'm using a free-to-start version of CockroachDB Serverless, which has enterprise changefeeds enabled. High-Level Steps Deploy a CockroachDB cluster with enterprise changefeeds Deploy Redpanda Connect Deploy Snowflake Verify Conclusion Step-By-Step Instructions Deploy a CockroachDB Cluster With Enterprise Changefeeds Start an instance of CockroachDB or use the managed service. To enable CDC we need to execute the following commands: SET CLUSTER SETTING cluster.organization = '<organization name>'; SET CLUSTER SETTING enterprise.license = '<secret>'; SET CLUSTER SETTING kv.rangefeed.enabled = true; I am using CockroachDB Serverless and the above steps are not necessary. You may confirm whether the changefeeds are indeed enabled using the following command: SHOW CLUSTER SETTING kv.rangefeed.enabled; If the value is false, change it to true. Generate sample data: CREATE TABLE office_dogs ( id INT PRIMARY KEY, name STRING); INSERT INTO office_dogs VALUES (1, 'Petee'), (2, 'Carl'); UPDATE office_dogs SET name = 'Petee H' WHERE id = 1; We've populated the table and then updated a record. Let's add more data to make it interesting: INSERT INTO office_dogs SELECT generate_series(3, 10000), md5(random()::string); SELECT * FROM office_dogs LIMIT 5; id,name 1,Petee H 2,Carl 3,6e19280ae649efffa7a58584c7f46032 4,5e4e897f008bb752c8edfa64a3aed356 5,abc0d898318d27f23a43060f89d62e34 SELECT COUNT(*) FROM office_dogs; Deploy Redpanda Connect I'm running Redpanda Connect in a Docker Compose file. docker compose -f compose-redpanda.yaml up -d The contents of the file are: services: redpanda: container_name: redpanda-connect hostname: redpanda-connect image: docker.redpanda.com/redpandadata/connect volumes: - ./redpanda/connect.yaml:/connect.yaml - /Users/aervits/.ssh/rsa_key.pem:/rsa_key.pem I will be using the connect.yaml file as the foundation to connect all the components in this article. For more detailed information, you can refer to the documentation provided by Redpanda. The most basic configuration looks like so: input: stdin: {} pipeline: processors: [] output: stdout: {} Since I'm using CockroachDB input, mine looks like this: input: # CockroachDB Input label: "" cockroachdb_changefeed: dsn: postgresql://<user>:<password>@<cockroachdb-cluster>:<port>/<database>?sslmode=verify-full tls: skip_cert_verify: true #enable_renegotiation: false #root_cas: "" #root_cas_file: "" client_certs: [] tables: [table_for_cdc] # No default (required) cursor_cache: "" # No default (optional) auto_replay_nacks: true pipeline: processors: [] output: stdout: {} Leave the pipeline and output as default. For reference, I'm including the repo with my source code where you can reference the values. If you have been following along, you may have noticed that I haven't started a changefeed job in CockroachDB. The cockroachdb_changefeed input directly subscribes to the table, which can be observed by examining the logs using the command docker logs redpanda-connect --follow. If you look at the connect.yaml file, the output is sent to stdout: {"primary_key":"[9998]","row":"{\"after\": {\"id\": 9998, \"name\": \"0794a9d1c99e8e47ee4515be6e0d736f\"}","table":"office_dogs"} {"primary_key":"[9999]","row":"{\"after\": {\"id\": 9999, \"name\": \"c85a6b38154f7e3085d467d567141d45\"}","table":"office_dogs"} {"primary_key":"[10000]","row":"{\"after\": {\"id\": 10000, \"name\": \"aae9e0849fff8f47e0371a4c06fb255b\"}","table":"office_dogs"} The next step is to configure Snowflake. We are not going to look at the available processors today. Deploy Snowflake I'm using a Snowflake trial account. You get a generous credit which should be sufficient to complete this tutorial. We need to create a database and a table where we will output the changefeed data. CREATE OR REPLACE DATABASE FROM_COCKROACH; CREATE OR REPLACE TABLE OFFICE_DOGS (RECORD variant); We also need to create a user with key-pair authentication as we're going to be using the Snowpipe service. openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out rsa_key.p8 We must use an encrypted key as Redpanda doesn't support unencrypted versions. Generate a public key: openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub Lastly, generate a pem file from the private key: openssl pkcs8 -in rsa_key.p8 -out rsa_key.pem In Snowflake, alter the user to use the key pair generated in the previous step. ALTER USER username SET rsa_public_key='MIIB...'; We can now populate the connect.yaml file with the required information for the snowflake_put output. This output type is for commercial use and requires a license, but since we're using it for demo purposes, we are able to proceed. output: # Snowflake Output label: "" snowflake_put: account: <snowflake-account> user: <user> private_key_file: rsa_key.pem role: ACCOUNTADMIN database: <database> warehouse: <warehouse> schema: <schema> stage: "@%implicit_table_stage_name" path: "path" upload_parallel_threads: 4 compression: NONE batching: count: 10 period: 3s processors: - archive: format: json_array max_in_flight: 1 If we restart the compose environment and tail the logs, we can see the following: level=info msg="Running main config from specified file" @service=benthos benthos_version=v4.32.1 path=/connect.yaml level=info msg="Listening for HTTP requests at: http://0.0.0.0:4195" @service=benthos level=info msg="Launching a Redpanda Connect instance, use CTRL+C to close" @service=benthos level=info msg="Output type snowflake_put is now active" @service=benthos label="" path=root.output level=info msg="Input type cockroachdb_changefeed is now active" @service=benthos label="" path=root.input Let's look at the implicit table stage and observe if anything has changed. list @%office_dogs | dogs/f2f3cf47-d6bc-46f4-88f2-c82519b67481.json | 1312 | 30f709e4962bae9d10b48565d22e9f32 | Wed, 14 Aug 2024 18:58:43 GMT | | dogs/f6adbf39-3955-4848-93c3-06f873a88078.json | 1312 | 28be7a619ef1e139599077e977ea130b | Wed, 14 Aug 2024 18:58:13 GMT | | dogs/f8705606-eb07-400a-9ffe-da6834fa1a30.json | 1296 | 5afbdce0e8929fc38a2eb5e0f12b96d6 | Wed, 14 Aug 2024 18:57:29 GMT | | dogs/f9e5c01a-7dda-4e76-840d-13b8a1e4946a.json | 1296 | 5480c01f1578f67afe2761c7619e9123 | Wed, 14 Aug 2024 18:57:32 GMT | | dogs/fad4efe7-3f3f-48bc-bdb4-9f0310abcf4d.json | 1312 | 5942c6e2dbaef5ee257d4a9b8e68827d | Wed, 14 Aug 2024 18:58:04 GMT | The files are ready to be copied into a table. Let's create a pipe: CREATE OR REPLACE PIPE FROM_COCKROACH.PUBLIC.cockroach_pipe AUTO_INGEST = FALSE AS COPY INTO FROM_COCKROACH.PUBLIC.OFFICE_DOGS FROM (SELECT * FROM @%office_dogs) FILE_FORMAT = (TYPE = JSON COMPRESSION = AUTO STRIP_OUTER_ARRAY = TRUE); The last remaining step is to refresh the pipe. ALTER PIPE cockroach_pipe REFRESH; | dogs/ff0871b1-6f49-43a4-a929-958d07f74046.json | SENT | | dogs/ff131d8d-3781-4cf6-8700-edd50dbb87de.json | SENT | | dogs/ff216da1-4f9d-4b37-9776-bcd559dd4a6f.json | SENT | | dogs/ff221430-4c3a-46be-bbc2-d335cc6cc9e3.json | SENT | | dogs/ffbd7d45-5084-4e36-8907-61874ac652b4.json | SENT | | dogs/fffb5fa6-23cc-4450-934a-29ccf01c67b9.json | SENT | Let's query the table in Snowflake: SELECT * FROM OFFICE_DOGS LIMIT 5; | { | | "primary_key": "[5241]", | | "row": "{\"after\": {\"id\": 5241, \"name\": \"5e0360a0d10d849afbbfa319a50bccf2\"}", | | "table": "office_dogs" | | } | | { | | "primary_key": "[5242]", | | "row": "{\"after\": {\"id\": 5242, \"name\": \"62be250249afe74bfbc5dd356e7b0ad9\"}", | | "table": "office_dogs" | | } | | { | | "primary_key": "[5243]", | | "row": "{\"after\": {\"id\": 5243, \"name\": \"7f286800a8a03e74938d09fdba52f869\"}", | | "table": "office_dogs" | | } | | { | | "primary_key": "[5244]", | | "row": "{\"after\": {\"id\": 5244, \"name\": \"16a330b8f09bcd314f9760ffe26d0ae2\"}", | | "table": "office_dogs" | | } We expect 10000 rows: SELECT COUNT(*) FROM OFFICE_DOGS; +----------+ | COUNT(*) | |----------| | 10000 | +----------+ The data is in JSON format. Let's create a view and flatten the data out. CREATE VIEW v_office_dogs AS SELECT PARSE_JSON(record:row):after:id::INTEGER AS id, PARSE_JSON(record:row):after:name::STRING AS name FROM OFFICE_DOGS; Query the view: SELECT * FROM v_office_dogs WHERE id < 6; +----+----------------------------------+ | ID | NAME | |----+----------------------------------| | 1 | Petee H | | 2 | Carl | | 3 | 6e19280ae649efffa7a58584c7f46032 | | 4 | 5e4e897f008bb752c8edfa64a3aed356 | | 5 | abc0d898318d27f23a43060f89d62e34 | +----+----------------------------------+ Verify Let's make things a bit more interesting and delete data in CockroachDB. DELETE FROM office_dogs WHERE name = 'Carl'; DELETE FROM office_dogs WHERE id = 1; In Snowflake, let's refresh the pipe as of a few minutes ago: ALTER PIPE cockroach_pipe REFRESH MODIFIED_AFTER='2024-08-14T12:10:00-07:00'; Notice there are a couple of files. +------------------------------------------------+--------+ | File | Status | |------------------------------------------------+--------| | dogs/2a4ee400-6b37-4513-97cb-097764a340bc.json | SENT | | dogs/8f5b5b69-8a00-4dbf-979a-60c3814d96b4.json | SENT | +------------------------------------------------+--------+ I must caution that if you run the REFRESH manually, you may cause duplicates in your Snowflake data. We will look at better approaches in a future article. Let's look at the row count: +----------+ | COUNT(*) | |----------| | 10002 | +----------+ The removal process didn't properly update in Snowflake as anticipated; it recognized the deleted records but failed to mirror the state in CockroachDB. We need to incorporate additional logic to achieve this. This will be a task for another time. Lastly, I would like to note that using Redpanda Connect as a compose file is optional. You have the option to run the Docker container by executing the following command: docker run --rm -it -v ./redpanda/connect.yaml:/connect.yaml -v ./snowflake/rsa_key.pem:/rsa_key.pem docker.redpanda.com/redpandadata/connect run Conclusion Today, we explored Redpanda Connect as a means to deliver streaming changefeeds into Snowflake. We've only just begun to delve into this topic, and future articles will build upon the foundations laid today.
Have you ever come across a situation in automated API testing where you were not able to identify the issue in the test failure and after debugging for multiple hours, you noticed that the data type of the value supplied in the response of the API had changed. Did you then notice that this was the core reason for the test failure? This type of scenario can generally happen when you have third-party APIs integrated into your application. A real-time example of such a scenario would be integrating with the Bank APIs for making a payment for your e-commerce application or integrating with third-party API which provides the registration and login functionality using two-factor authentication. In such a situation, though, you would be provided with detailed documentation of the APIs and functionality if it happens that there is a change in the API response from the third-party application since they cater to multiple clients and might have updated their API, maybe for a bug fix or a new feature requirement which you are unaware of. One of the data type of the field received in response may be changed to integer from String or vice-versa. Or there is a new field/Object added in the response. Thanks to JSON Schema Validation, these changes can now be caught easily and can save a lot of your efforts and time in debugging and finding the issue that leads to the failure of your system. Before we begin with discussing the JSON Schema validation, let’s first understand what JSON is, and then continue with the JSON Schema Validation. What Is JSON? JSON stands for JavaScript Object Notation. It was originally specified by Douglas Crockford. It is a lightweight format for storing and transporting data and is often used when data is sent from the server to webpages. It is self-describing and easy to understand. The following are important syntax rules for JSON: Data is in name/value pairs Data is separated by commas Curly braces hold objects Square brackets hold arrays To understand further, let's take the following JSON file as an example: JSON { "page": 1, "total_pages": 2, "employee_data": [{ "id": 5, "first_name": "Michael", "last_name": "Doe", "designation": "QA", "location": "Remote" }, { "id": 6, "first_name": "Johnny", "last_name": "Ford", "designation": "QA", "location": "NY,US" } ], "company": { "name": "QA Inc", "Address": "New Jersey, US" } } Understanding the JSON File The above-mentioned file begins with a curly brace {which means the file holds a JSON object. Inside the JSON object, data is stored in multiple data types as follows: 1. The root level itself is a JSON Object as it has a curly bracket to start with and has data stored in a key/value pair JSON { "page": 1, "total_pages": 2 } 2. JSON Array JSON Array stores data inside the JSON file in a block with square bracket []. If we take the example of the JSON file mentioned above,employee_data JSON array has 2 JSON Objects inside it. JSON "employee_data": [{ "id": 5, "first_name": "Michael", "last_name": "Doe", "designation": "QA", "location": "Remote" }, { "id": 6, "first_name": "Johnny", "last_name": "Ford", "designation": "QA", "location": "NY,US" } ] 3. JSON Object As mentioned earlier, data stored within curly braces are JSON Objects and have multiple key/value pairs in them. The company JSON Object holds the data for company details: JSON "company": { "name": "QA Inc", "Address": "New Jersey, US" } It can also be referred as company key holding the company details record in its value. What Is JSON Schema? JSON Schema is a specification for JSON-based format for defining the structure of JSON data. JSON Schema helps us describe the existing data format and provides clear, human and machine-readable documentation. As JSON Schema provides complete structural validation, it helps in automated tests and also validating the client-submitted data for verification. How Do I Generate JSON Schema for the JSON Request of an API? Consider the following example of Post Response from a restful-booker website where the following data is returned in response once the user hits the post API for creating a new booking: JSON { "bookingid": 1, "booking": { "firstname": "Jim", "lastname": "Brown", "totalprice": 111, "depositpaid": true, "bookingdates": { "checkin": "2018-01-01", "checkout": "2019-01-01" }, "additionalneeds": "Breakfast" } } To generate the JSON Schema, we would be using an online JSON schema generator tool from extendsclass.com. Using this tool is very simple, you just need to copy and paste the JSON data for which you need to generate the JSON schema and click on the Generate Schema from JSON button on the web page and it will provide you with the JSON schema for the respective JSON data provided. Here is the JSON Schema generated for the above JSON data for creating a new booking: JSON { "definitions": {}, "$schema": "http://json-schema.org/draft-07/schema#", "$id": "https://example.com/object1661496173.json", "title": "Root", "type": "object", "required": [ "bookingid", "booking" ], "properties": { "bookingid": { "$id": "#root/bookingid", "title": "Bookingid", "type": "integer", "examples": [ 1 ], "default": 0 }, "booking": { "$id": "#root/booking", "title": "Booking", "type": "object", "required": [ "firstname", "lastname", "totalprice", "depositpaid", "bookingdates", "additionalneeds" ], "properties": { "firstname": { "$id": "#root/booking/firstname", "title": "Firstname", "type": "string", "default": "", "examples": [ "Jim" ], "pattern": "^.*$" }, "lastname": { "$id": "#root/booking/lastname", "title": "Lastname", "type": "string", "default": "", "examples": [ "Brown" ], "pattern": "^.*$" }, "totalprice": { "$id": "#root/booking/totalprice", "title": "Totalprice", "type": "integer", "examples": [ 111 ], "default": 0 }, "depositpaid": { "$id": "#root/booking/depositpaid", "title": "Depositpaid", "type": "boolean", "examples": [ true ], "default": true }, "bookingdates": { "$id": "#root/booking/bookingdates", "title": "Bookingdates", "type": "object", "required": [ "checkin", "checkout" ], "properties": { "checkin": { "$id": "#root/booking/bookingdates/checkin", "title": "Checkin", "type": "string", "default": "", "examples": [ "2018-01-01" ], "pattern": "^.*$" }, "checkout": { "$id": "#root/booking/bookingdates/checkout", "title": "Checkout", "type": "string", "default": "", "examples": [ "2019-01-01" ], "pattern": "^.*$" } } } , "additionalneeds": { "$id": "#root/booking/additionalneeds", "title": "Additionalneeds", "type": "string", "default": "", "examples": [ "Breakfast" ], "pattern": "^.*$" } } } } } Understanding the JSON Schema If you check the JSON data, the following two fields are the main records: bookingid Object of bookingdata The following block generated in JSON Schema talks about these 2 fields that in root, these two fields are required as an Object type. JSON "title": "Root", "type": "object", "required": [ "bookingid", "booking" ], Next, let's talk about the properties block inside the JSON Schema. The following block states that bookingid should be in the root object and its type should be integer . Hence, in response, it is expected that the value in this field should be an integer only. So, in case this type is changed to any other data type like String ,Object ,longor float, schema validation will fail and we would be able to identify the issue in the schema right away. JSON "properties": { "bookingid": { "$id": "#root/bookingid", "title": "Bookingid", "type": "integer", "examples": [ 1 ], "default": 0 }, "booking": { "$id": "#root/booking", "title": "Booking", "type": "object", "required": [ "firstname", "lastname", "totalprice", "depositpaid", "bookingdates", "additionalneeds" ], "properties": { "firstname": { "$id": "#root/booking/firstname", "title": "Firstname", "type": "string", "default": "", "examples": [ "Jim" ], "pattern": "^.*$" }, "lastname": { "$id": "#root/booking/lastname", "title": "Lastname", "type": "string", "default": "", "examples": [ "Brown" ], "pattern": "^.*$" }, "totalprice": { "$id": "#root/booking/totalprice", "title": "Totalprice", "type": "integer", "examples": [ 111 ], "default": 0 }, "depositpaid": { "$id": "#root/booking/depositpaid", "title": "Depositpaid", "type": "boolean", "examples": [ true ], "default": true }, "bookingdates": { "$id": "#root/booking/bookingdates", "title": "Bookingdates", "type": "object", "required": [ "checkin", "checkout" ], "properties": { "checkin": { "$id": "#root/booking/bookingdates/checkin", "title": "Checkin", "type": "string", "default": "", "examples": [ "2018-01-01" ], "pattern": "^.*$" }, "checkout": { "$id": "#root/booking/bookingdates/checkout", "title": "Checkout", "type": "string", "default": "", "examples": [ "2019-01-01" ], "pattern": "^.*$" } } } , "additionalneeds": { "$id": "#root/booking/additionalneeds", "title": "Additionalneeds", "type": "string", "default": "", "examples": [ "Breakfast" ], "pattern": "^.*$" } } } } } Likewise, you can notice the data types and required field values mentioned for the other fields in the JSON Schema. Performing the JSON Schema Validation Using Rest-Assured Framework What Is Rest-Assured? REST-Assured is a Java library that provides a domain-specific language (DSL) for writing powerful, maintainable tests for RESTful APIs. One thing I really like about rest assured is its BDD style of writing tests and one can read the tests very easily in a human-readable language. Getting Started The project is created using Maven. Once the project is created we need to add the dependency for rest-assured in pom.xml file. TestNG is used as a test runner. The following dependencies are mandatorily required to be added in pom.xml rest-assured dependency is required for running the API tests and json-schema-validator dependency is required for validating the JSON Schema. Validating the JSON Schema Step 1: First of all, we need to generate the JSON Schema for the response JSON data which we need to validate. As we are using the restful-booker create booking API, we would be copy-pasting the JSON response and generating the JSON schema using the JSON Schema Validator. In the screenshot below, on the left-hand side we have the JSON response. On clicking the Generate Schema from JSON button , JSON Schema would be generated on the right-hand section. The following is the response received from create booking API in restful-booker. JSON { "bookingid": 1, "booking": { "firstname": "Jim", "lastname": "Brown", "totalprice": 111, "depositpaid": true, "bookingdates": { "checkin": "2018-01-01", "checkout": "2019-01-01" }, "additionalneeds": "Breakfast" } } The following is the JSON Schema for the above JSON response. JSON { "definitions": {}, "$schema": "http://json-schema.org/draft-07/schema#", "$id": "https://example.com/object1661586892.json", "title": "Root", "type": "object", "required": [ "bookingid", "booking" ], "properties": { "bookingid": { "$id": "#root/bookingid", "title": "Bookingid", "type": "integer", "examples": [ 1 ], "default": 0 }, "booking": { "$id": "#root/booking", "title": "Booking", "type": "object", "required": [ "firstname", "lastname", "totalprice", "depositpaid", "bookingdates", "additionalneeds" ], "properties": { "firstname": { "$id": "#root/booking/firstname", "title": "Firstname", "type": "string", "default": "", "examples": [ "Jim" ], "pattern": "^.*$" }, "lastname": { "$id": "#root/booking/lastname", "title": "Lastname", "type": "string", "default": "", "examples": [ "Brown" ], "pattern": "^.*$" }, "totalprice": { "$id": "#root/booking/totalprice", "title": "Totalprice", "type": "integer", "examples": [ 111 ], "default": 0 }, "depositpaid": { "$id": "#root/booking/depositpaid", "title": "Depositpaid", "type": "boolean", "examples": [ true ], "default": true }, "bookingdates": { "$id": "#root/booking/bookingdates", "title": "Bookingdates", "type": "object", "required": [ "checkin", "checkout" ], "properties": { "checkin": { "$id": "#root/booking/bookingdates/checkin", "title": "Checkin", "type": "string", "default": "", "examples": [ "2018-01-01" ], "pattern": "^.*$" }, "checkout": { "$id": "#root/booking/bookingdates/checkout", "title": "Checkout", "type": "string", "default": "", "examples": [ "2019-01-01" ], "pattern": "^.*$" } } } , "additionalneeds": { "$id": "#root/booking/additionalneeds", "title": "Additionalneeds", "type": "string", "default": "", "examples": [ "Breakfast" ], "pattern": "^.*$" } } } } } Project Folder Structure We can copy the JSON Schema create a new JSON file and put it in the src\test\resources folder inside the project. Writing JSON Schema Validation Test The following test script will allow us to test the JSON Schema validation using the Rest-Assured framework. Java @Test public void testCreateBookingJsonSchema() { InputStream createBookingJsonSchema = getClass().getClassLoader() .getResourceAsStream("createbookingjsonschema.json"); BookingData newBooking = getBookingData(); bookingId = given().body(newBooking) .when() .post("/booking") .then() .statusCode(200) .and() .assertThat() .body(JsonSchemaValidator.matchesJsonSchema(createBookingJsonSchema)) .and() .extract() .path("bookingid"); } It is pretty simple to write automation tests using Rest-Assured. We need to write the assertion for validating the JSON Schema inside the body() method after the assertThat() method. But before we move to the assertion, we need to read the JSON file we posted inside the src\test\resources folder. To do that we would be using the InputStream class. The following line of code will help us in reading the JSON Schema file createbookingjsonschema.json Java InputStream createBookingJsonSchema = getClass ().getClassLoader () .getResourceAsStream ("createbookingjsonschema.json"); Next, we need to hit the post API and check the JSON Schema in response by using JsonSchemaValidator.matchesJsonSchema() method and pass the createBookingJsonSchema InputStream instance in it. The data required in the post-request payload will be generated using Builder pattern + Data Faker. The following is the implementation of getBookingData() method that is available in the BookingData class. Java public class BookingDataBuilder { private static final Faker FAKER = new Faker (); public static BookingData getBookingData () { SimpleDateFormat formatter = new SimpleDateFormat ("YYYY-MM-dd"); return BookingData.builder () .firstname (FAKER.name () .firstName ()) .lastname (FAKER.name () .lastName ()) .totalprice (FAKER.number () .numberBetween (1, 2000)) .depositpaid (true) .bookingdates (BookingDates.builder () .checkin (formatter.format (FAKER.date () .past (20, TimeUnit.DAYS))) .checkout (formatter.format (FAKER.date () .future (5, TimeUnit.DAYS))) .build ()) .additionalneeds ("Breakfast") .build (); } } Once the payload data is generated, it is very easy to write the JSON Schema validation test. The following lines of code will help us in validating the JSON Schema in the response. Interpreting the lines of code given below, we are sending a post request with the body as required for Post API after which we are checking the status code returned in response is 200 and that the body has the JSON Schema as provided in the createBookingJsonSchema instance. Java given ().body (newBooking) .when () .post ("/booking") .then () .statusCode (200) .and() .assertThat () .body (JsonSchemaValidator.matchesJsonSchema (createBookingJsonSchema)); Running the Tests Ite time now to run the test and check if the Schema validation happens correctly. Here is the Screenshot of testng.xml file. XML <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE suite SYSTEM "http://testng.org/testng-1.0.dtd"> <suite name="Restful Booker Test Suite"> <test name="Restful Booker JSON Schema Validation tests"> <classes> <class name="com.restfulbooker.JsonSchemaValidationTest"> <methods> <include name="testCreateBookingJsonSchema"/> </methods> </class> </classes> </test> </suite> <!-- Suite --> Let’s run the tests now and validate the JSON Schema. We would be running the tests using TestNG by right-clicking on the testng.xml file. The JSON Schema received in the Response matches with the JSON Schema provided in the src\test\resources folder passing the test. Now, let’s make some changes in the JSON Schema in the createbookingjsonschema.json file provided in the src\test\resources In bookingid field, value of type integer is required however it has been updated to string just to check if the validation is actually working fine. JSON "properties": { "bookingid": { "$id": "#root/bookingid", "title": "Bookingid", "type": "string", "examples": [ 1 ], "default": 0 }, Let’s run the test again by right-clicking on the testng.xml file. The following error log was generated and displayed in the console, which says that the value received in response was an integer whereas the value expected was string JSON error: instance type (integer) does not match any allowed primitive type (allowed: ["string"]) level: "error" schema: {"loadingURI":"#","pointer":"/properties/bookingid"} instance: {"pointer":"/bookingid"} domain: "validation" keyword: "type" found: "integer" expected: ["string"] You See! How easy it is to identify such kind of schema-related errors which if not done could have taken a lot of your effort as well as time to find it. Conclusion Running automated tests for checking the JSON Schema validation could prove to be a fruitful exercise and help in detecting the schema-level issues before they slip into production. It is recommended to add these checks in the automated pipeline and run them as regression tests in the nightly build. Happy Testing!
If you are developing a RESTful API with Spring Boot, you want to make it as easy as possible for other developers to understand and use your API. Documentation is essential because it provides a reference for future updates and helps other developers integrate with your API. For a long time, the way to document REST APIs was to use Swagger, an open-source software framework that enables developers to design, build, document, and consume RESTful Web services. In 2018, to address the issues of code invasiveness and dependency associated with traditional API documentation tools like Swagger, we developed smart-doc and open-sourced it to the community. In this article, we will explore how to use Smart-doc to generate documentation for a Spring Boot REST API. What Is Smart-doc? Smart-doc is an interface documentation generation tool for Java projects. It primarily analyzes and extracts comments from Java source code to produce API documentation. Smart-doc scans standard Java comments in the code, eliminating the need for specialized annotations like those used in Swagger, thus maintaining the simplicity and non-invasiveness of the code. It supports multiple formats for document output, including Markdown, HTML5, Postman Collection, OpenAPI 3.0, etc. This flexibility allows developers to choose the appropriate documentation format based on their needs. Additionally, Smart-doc can scan code to generate JMeter performance testing scripts. For more features, please refer to the official documentation. Steps To Use Smart-doc for Documenting APIs Step 1: Maven Project Create a Maven project with the latest version of Spring Boot Add the Web dependencies to the project Step 2: Add Smart-doc Into the Project Add smart-doc-maven-plugin to the project's pom.xml XML <plugin> <groupId>com.ly.smart-doc</groupId> <artifactId>smart-doc-maven-plugin</artifactId> <version>[latest version]</version> <configuration> <configFile>./src/main/resources/smart-doc.json</configFile> <projectName>${project.description}</projectName> </configuration> </plugin> Create the smart-doc.json file in the resources directory of the module where the project startup class is located. JSON { "outPath": "/path/to/userdir" } Step 3: Create a Rest Controller Now let's create a controller class that will handle HTTP requests and return responses. Create a controller class that will be sent as a JSON response. Java public class User { /** * user id * */ private long id; /** * first name */ private String firstName; /** * last name */ private String lastName; /** * email address */ private String email; public long getId() { return id; } public void setId(long id) { this.id = id; } public String getFirstName() { return firstName; } public void setFirstName(String firstName) { this.firstName = firstName; } public String getLastName() { return lastName; } public void setLastName(String lastName) { this.lastName = lastName; } public String getEmail() { return email; } public void setEmail(String email) { this.email = email; } } Now create a service class Java @Repository public class UserRepository { private static final Map<Long, User> users = new ConcurrentHashMap<>(); static { User user = new User(); user.setId(1); user.setEmail("123@gmail.com"); user.setFirstName("Tom"); user.setLastName("King"); users.put(1L,user); } public Optional<User> findById(long id) { return Optional.ofNullable(users.get(id)); } public void add(User book) { users.put(book.getId(), book); } public List<User> getUsers() { return users.values().stream().collect(Collectors.toList()); } public boolean delete(User user) { return users.remove(user.getId(),user); } } Create the RestController Class. Java /** * The type User controller. * * @author yu 2020/12/27. */ @RestController @RequestMapping("/api/v1") public class UserController { @Resource private UserRepository userRepository; /** * Create user. * * @param user the user * @return the user */ @PostMapping("/users") public ResponseResult<User> createUser(@Valid @RequestBody User user) { userRepository.add(user); return ResponseResult.ok(user); } /** * Get all users list. * * @return the list */ @GetMapping("/users") public ResponseResult<List<User>> getAllUsers() { return ResponseResult.ok().setResultData(userRepository.getUsers()); } /** * Gets users by id. * * @param userId the user id|1 * @return the users by id */ @GetMapping("/users/{id}") public ResponseResult<User> getUsersById(@PathVariable(value = "id") Long userId) { User user = userRepository.findById(userId). orElseThrow(() -> new ResourceNotFoundException("User not found on :: " + userId)); return ResponseResult.ok().setResultData(user); } /** * Update user response entity. * * @param userId the user id|1 * @param userDetails the user details * @return the response entity */ @PutMapping("/users/{id}") public ResponseResult<User> updateUser(@PathVariable(value = "id") Long userId, @Valid @RequestBody User userDetails) { User user = userRepository.findById(userId). orElseThrow(() -> new ResourceNotFoundException("User not found on :: " + userId)); user.setEmail(userDetails.getEmail()); user.setLastName(userDetails.getLastName()); user.setFirstName(userDetails.getFirstName()); userRepository.add(user); return ResponseResult.ok().setResultData(user); } /** * Delete user. * * @param userId the user id|1 * @return the map */ @DeleteMapping("/user/{id}") public ResponseResult<Boolean> deleteUser(@PathVariable(value = "id") Long userId) { User user = userRepository.findById(userId). orElseThrow(() -> new ResourceNotFoundException("User not found on :: " + userId)); return ResponseResult.ok().setResultData(userRepository.delete(user)); } } Step 4: Generate a Document You can use the Smart-doc plugin in IntelliJ IDEA to generate the desired documentation, such as OpenAPI, Markdown, etc. Of course, you can also use the Maven command to generate: Shell mvn smart-doc:html // Generate document output to Markdown mvn smart-doc:markdown // Generate document output to Adoc mvn smart-doc:adoc // Generate Postman. mvn smart-doc:postman // Generate OpenAPI 3.0+ mvn smart-doc:openapi Step 4: Import to Postman Here we use Smart-doc to generate a Postman.json, then import it into Postman to see the effect. Since smart-doc supports generating documentation in multiple formats, you can choose to generate OpenAPI and then display it using Swagger UI or import it into some professional API documentation systems. Conclusion From the previous examples, it can be seen that Smart-doc generates documentation by scanning standard Java comments in the code, without the need for specialized annotations like Swagger, thus maintaining the simplicity and non-invasiveness of the code, and also not affecting the size of the service Jar package. It supports multiple formats for document output, including Markdown, HTML5, Postman Collection,OpenAPI 3.0, etc. This flexibility allows developers to choose the appropriate document format for output based on their needs. The Maven or Gradle plugins provided by smart-doc also facilitate users in integrating document generation in Devops pipelines. Currently, Swagger also has its advantages, such as more powerful UI features, and better support for Springboot Webflux.
If you’ve spent a lot of time creating and editing documents in the MS Word application, there’s a good chance you’ve heard of (and maybe even used) the DOCX comparison feature. This simple, manual comparison tool produces a three-pane view displaying the differences between two versions of a file. It’s a useful tool for summarizing the journey legal contracts (or other, similar documents that tend to start as templates) take when they undergo multiple rounds of collaborative edits. As useful as manual DOCX document comparisons are, they’re still manual, which immediately makes them inefficient at scale. Thankfully, though, the open-source file structure DOCX is based on - OpenXML - is designed to facilitate the automation of manual processes like this by making Office document file structure easily accessible to programmers. With the right developer tools, you can make programmatic DOCX comparisons at scale in your own applications. In this article, you’ll learn how to carry out DOCX comparisons programmatically by calling a specialized web API with Java code examples. This will help you automate DOCX comparisons without the need to understand OpenXML formatting or write a ton of new code. Before we get to our demonstration, however, we'll first briefly review OpenXML formatting, and we'll also learn about an open-source library that can be used to read and write Office files in Java. Understanding OpenXML OpenXML formatting has been around for a long time now (since 2007), and it’s the standard all major Office documents are currently based on. Thanks to OpenXML formatting, all Office files – including Word (DOCX), Excel (XLSX), PowerPoint (PPTX), and others – are structured as open-source zip archives containing compressed metadata, file specifications, etc. in XML format. We can easily review this file structure for ourselves by renaming Office files as .zip files. To do that, we can CD into one of our DOCX file's directories (Windows) and rename our file using the below command (replacing the example file name below with our own file name): Shell ren "hello world".docx "hello world".zip We can then open the .zip version of our DOCX file and poke around in our file archive. When we open DOCX files in our MS Word application, our files are unzipped, and we can then use various built-in application tools to manipulate our files’ contents. This open-source file structure makes it relatively straightforward to build applications that read and write DOCX files. It is, to use a well-known example, the reason why programs like Google Drive can upload and manipulate DOCX files in their own text editor applications. With a good understanding of OpenXML structure, we could build our own text editor applications to manipulate DOCX files if we wanted – it would just be a LOT of work. It wouldn’t be especially worth our time, either, given the number of applications and programming libraries that already exist for exactly that purpose. Writing DOCX Comparisons in Java While the OpenXML SDK is open source (hosted on GitHub for anyone to use), it’s written to be used with .NET languages like C#. If we were looking to automate DOCX comparisons with an open-source library in Java, we would need to use something like the Apache POI library to build our application instead. Our process would roughly entail: Adding Apache POI dependencies to our pom.xml Importing the XWPF library (designed for OpenXML files) Writing some code to load and extract relevant content from our documents Part 3 is where things would start to get complicated - we would need to write a bunch of code to retrieve and compare paragraph elements from each document, and if we wanted to ensure consistent formatting across both of our documents (important for our resulting comparison document), we would need to break down our paragraphs into runs. We would then, of course, need to implement our own robust error handling before writing our DOCX comparison result to a new file. Advantages of a Web API for DOCX Comparison Writing our DOCX comparison from scratch would take time, and it would also put the burden of our file-processing operation squarely on our own server. That might not be a big deal for comparisons involving smaller-sized DOCX documents, but it would start to take a toll with larger-sized documents and larger-scale (higher volume) operations. By calling a web API to handle our DOCX comparison instead, we’ll limit the amount of code we need to write, and we’ll offload the heavy lifting in our comparison workflow to an external server. That way, we can focus more of our hands-on coding efforts on building robust features in our application that handle the results of our DOCX comparisons in various ways. Demonstration Using the code examples below, we can call an API that simplifies the process of automating DOCX comparisons. Rather than writing a bunch of new code, we’ll just need to copy relevant examples, load our input files, and write our resulting comparison strings to new DOCX files of their own. To help demonstrate what the output of our programmatic comparison looks like, I’ve included a screenshot from a simple DOCX comparison result below. This document shows the comparison between two versions of a classic Lorem Ipsum passage – one containing all of the original Latin text, and the other containing a few lines of English text: To structure our API call, we can begin by installing the client SDK. Let’s add a reference to our pom.xml repository: XML <repositories> <repository> <id>jitpack.io</id> <url>https://jitpack.io</url> </repository> </repositories> And let’s add a reference to the dependency in our pom.xml: XML <dependencies> <dependency> <groupId>com.github.Cloudmersive</groupId> <artifactId>Cloudmersive.APIClient.Java</artifactId> <version>v4.25</version> </dependency> </dependencies> After that, we can add the following Imports to our controller: Java // Import classes: //import com.cloudmersive.client.invoker.ApiClient; //import com.cloudmersive.client.invoker.ApiException; //import com.cloudmersive.client.invoker.Configuration; //import com.cloudmersive.client.invoker.auth.*; //import com.cloudmersive.client.CompareDocumentApi; Now we can turn our attention to configuration. We’ll need to supply a free Cloudmersive API key (this allows 800 API calls/month with no commitments) in the following configuration snippet: Java ApiClient defaultClient = Configuration.getDefaultApiClient(); // Configure API key authorization: Apikey ApiKeyAuth Apikey = (ApiKeyAuth) defaultClient.getAuthentication("Apikey"); Apikey.setApiKey("YOUR API KEY"); // Uncomment the following line to set a prefix for the API key, e.g. "Token" (defaults to null) //Apikey.setApiKeyPrefix("Token"); Next, we can use our final code examples below to create an instance of the API and call the DOCX comparison function: Java CompareDocumentApi apiInstance = new CompareDocumentApi(); File inputFile1 = new File("/path/to/inputfile"); // File | First input file to perform the operation on. File inputFile2 = new File("/path/to/inputfile"); // File | Second input file to perform the operation on (more than 2 can be supplied). try { byte[] result = apiInstance.compareDocumentDocx(inputFile1, inputFile2); System.out.println(result); } catch (ApiException e) { System.err.println("Exception when calling CompareDocumentApi#compareDocumentDocx"); e.printStackTrace(); } Now we can easily automate DOCX comparisons with a few lines of code. If our input DOCX files contain any errors, the endpoint will try to auto-repair the files before making the comparison. Conclusion In this article, we learned about the MS Word DOCX Comparison tool and discussed how DOCX comparisons can be automated (thanks to OpenXML formatting). We then learned how to call a low-code DOCX comparison API with Java code examples.
Cucumber is a tool that supports Behavior-Driven Development (BDD). In this blog, you will learn how to pass arguments to step definitions when using Cucumber and Spring Boot. Enjoy! Introduction In a previous post, Cucumber was introduced as a tool that supports Behavior-Driven Development (BDD). Some of the features were explained, but not how to pass arguments to step definitions. In this blog, you will learn how you can do so. The application under test is a Spring Boot application. You will also learn how you can integrate the Cucumber tests with Spring. The sources used in this blog are available on GitHub. Do check out the following references for extra information: Cucumber Expressions Cucumber Configuration: Type Registry Prerequisites The prerequisites for this blog are: Basis Java knowledge - Java 21 is used Basic Maven knowledge Basic Spring Boot knowledge Basic comprehension of BDD Basic knowledge of Cucumber (see the previous blog for an introduction) Application Under Test The application under test is a basic Spring Boot application. It consists of a Controller and a Service. The Controller serves a customer endpoint that implements an OpenAPI specification. The Service is a basic implementation, storing customers in a HashMap. A customer only has a first name and a last name, just to keep things simple. The API offers the following functionality: Creating a customer Retrieving the customer based on the customer ID Retrieving all customers Deleting all customers Spring Integration In order to enable the Spring integration, you add the following dependency to the pom: XML <dependency> <groupId>io.cucumber</groupId> <artifactId>cucumber-spring</artifactId> <version>7.14.0</version> <scope>test</scope> </dependency> The Spring Boot application must be in a running state; therefore, you need to run the Cucumber tests with the @SpringBootTest annotation. This will start the application and you will be able to run Cucumber tests for it. In order to do so, you create a class CucumberSpringConfiguration. Add the @CucumberContextConfiguration annotation so that the Spring integration is enabled. The Spring Boot application starts on a random port; therefore, you store the port in a system property so that you will be able to use it when you need to call the API. Java @CucumberContextConfiguration @SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT) public class CucumberSpringConfiguration { @LocalServerPort private int port; @PostConstruct public void setup() { System.setProperty("port", String.valueOf(port)); } } The Cucumber step definitions will extend this class. Tests can be run via Maven: Shell $ mvn clean verify Test: Add Customer Using Arguments The Add Customer test will add a customer to the customer list and will verify that the customer is added to the list. The feature file is the following. Do note that the first name (John) and last name (Doe) are within quotes. This way, Cucumber is able to recognize a string argument. Plain Text Scenario: Add customer Given an empty customer list When customer 'John' 'Doe' is added Then the customer 'John' 'Doe' is added to the customer list The corresponding step definitions are the following. When: The first name and last name placeholders are represented with {string} and they are mapped as arguments to the method. This way, the arguments are accessible to the step definition. Then: In a similar way, the arguments are passed to the step definition. Java public class StepDefinitions extends CucumberSpringConfiguration { final int port = Integer.parseInt(System.getProperty("port")); final RestClient restClient = RestClient.create(); @Given("an empty customer list") public void an_empty_customer_list() { ResponseEntity<Void> response = restClient.delete() .uri("http://localhost:"+ port + "/customer") .retrieve() .toBodilessEntity(); } @When("customer {string} {string} is added") public void customer_firstname_lastname_is_added(String firstName, String lastName) { Customer customer = new Customer(firstName, lastName); ResponseEntity<Void> response = restClient.post() .uri("http://localhost:"+ port + "/customer") .contentType(APPLICATION_JSON) .body(customer) .retrieve() .toBodilessEntity(); assertThat(response.getStatusCode().is2xxSuccessful()).isTrue(); } @Then("the customer {string} {string} is added to the customer list") public void the_customer_first_name_last_name_is_added_to_the_customer_list(String firstName, String lastName) { List<Customer> customers = restClient.get() .uri("http://localhost:"+ port + "/customer") .retrieve() .body(new ParameterizedTypeReference<>() {}); assertThat(customers).contains(new Customer(firstName, lastName)); } ... } Note that the arguments are used to create a Customer object which is defined in the step definitions class. This class contains the fields, getters, setters, equals, and hashCode implementations. Java public static class Customer { private String firstName; private String lastName; ... } Test: Add Customers Using Arguments When you want to add several customers, you can chain the same step definition by means of an And using different arguments. The feature file is the following, the step definitions remain the same. Plain Text Scenario: Add customers Given an empty customer list When customer 'John' 'Doe' is added And customer 'David' 'Beckham' is added Then the customer 'John' 'Doe' is added to the customer list And the customer 'David' 'Beckham' is added to the customer list Test: Add Customer Using DataTable The previous tests all started with an empty customer list. The next test will add some data to the customer list as a starting point. You can, of course, use the step definition customer firstName lastName is added and invoke it multiple times, but you can also use a DataTable. The DataTable must be the last argument in a step definition. The feature file is the following and the DataTable is used in the Given-clause. Plain Text Scenario: Add customer to existing customers Given the following customers: | John | Doe | | David | Beckham | When customer 'Bruce' 'Springsteen' is added Then the customer 'Bruce' 'Springsteen' is added to the customer list In the implementation of the step definition, you now see that the arguments are passed as a DataTable. It is a table containing strings, so you need to parse the table yourself. Java @Given("the following customers:") public void the_following_customers(io.cucumber.datatable.DataTable dataTable) { for (List<String> customer : dataTable.asLists()) { customer_firstname_lastname_is_added(customer.get(0), customer.get(1)); } } Test: Add Customer Using Parameter Type In the previous test, you needed to parse the DataTable yourself. Wouldn’t it be great if the DataTable could be mapped immediately to a Customer object? This is possible if you define a parameter type for it. You create a parameter type customerEntry and annotate it with @DataTableType. You use the string arguments of a DataTable to create a Customer object. You do so in a class ParameterTypes, which is considered as best practice. Java public class ParameterTypes { @DataTableType public StepDefinitions.Customer customerEntry(Map<String, String> entry) { return new StepDefinitions.Customer( entry.get("firstName"), entry.get("lastName")); } } The feature file is identical to the previous one, only the step definition has changed in order to have a unique step definition. Plain Text Scenario: Add customer to existing customers with parameter type Given the following customers with parameter type: | John | Doe | | David | Beckham | When customer 'Bruce' 'Springsteen' is added Then the customer 'Bruce' 'Springsteen' is added to the customer list In the implementation of the step definition, you notice that the argument is not a DataTable anymore, but a list of Customer. Java @Given("the following customers with parameter type:") public void the_following_customers_with_parameter_type(List<Customer> customers) { for (Customer customer : customers) { customer_firstname_lastname_is_added(customer.getFirstName(), customer.getLastName()); } } Conclusion In this blog, you learned how to integrate Cucumber with a Spring Boot application and several ways to pass arguments to your step definitions - a powerful feature of Cucumber!
When it comes to data integration, some people may wonder what there is to discuss — isn't it just ETL? That is, extracting from various databases, transforming, and ultimately loading into different data warehouses. However, with the rise of big data, data lakes, real-time data warehouses, and large-scale models, the architecture of data integration has evolved from the ETL of the data warehouse era to the ELT of the big data era and now to the current stage of EtLT. In the global tech landscape, emerging EtLT companies like FiveTran, Airbyte, and Matillion have emerged, while giants like IBM have invested $2.3 billion in acquiring StreamSets and webMethods to upgrade their product lines from ETL to EtLT (DataOps). Whether you're a data professional or a manager, it's crucial to re-evaluate the recent changes and future trends in data integration. Chapter 1: From ETL to ELT, to EtLT When it comes to data integration, many in the industry may think it's just about ETL. However, with the rise of big data, data lakes, real-time data warehouses, and large-scale models, the architecture of data integration has evolved from the ETL of the data warehouse era to the ELT of the big data era and now to the EtLT. Globally, new emerging EtLT companies like FiveTran, Airbyte, and Matllion have been established, while established players like IBM are upgrading their product lines from ETL to EtLT (DataOps) with offerings such as StreamSet and webMethods. Whether you're a manager in an enterprise or a professional in the data field, it's essential to re-examine the changes in data integration in recent times and future trends. ETL Architecture Most experts in the data field are familiar with the term ETL. During the heyday of data warehousing, ETL tools like IBM DataStage, Informatica, Talend, and Kettle were popular. Some companies still use these tools to extract data from various databases, transform it, and load it into different data warehouses for reporting and analysis. The pros and cons of the ETL architecture are as follows: Advantages of ETL Architecture Data consistency and quality Integration of complex data sources Clear technical architecture Implementation of business rules Disadvantages of ETL Architecture Lack of real-time processing High hardware costs Limited flexibility Maintenance costs Limited handling of unstructured data ELT Architecture With the advent of the big data era, facing the challenges of ETL's inability to load complex data sources and its poor real-time performance, a variant of ETL architecture, ELT, emerged. Companies started using ELT tools provided by various data warehousing vendors, such as Teradata's BETQ/Fastload/TPT and Hadoop Hive's Apache Sqoop. The characteristics of ELT architecture include directly loading data into data warehouses or big data platforms without complex transformations and then using SQL or H-SQL to process the data. The pros and cons of the ELT architecture are as follows: Advantages of ELT Architecture Handling large data volumes Improved development and operational efficiency Cost-effectiveness Flexibility and scalability Integration with new technologies Disadvantages of ELT Architecture Limited real-time support High data storage costs Data quality issues Dependence on target system capabilities EtLT Architecture With the popularity of data lakes and real-time data warehouses, the weaknesses of ELT architecture in real-time processing and handling unstructured data have been highlighted. Thus, a new architecture, EtLT, has emerged. EtLT architecture enhances ELT by adding real-time data extraction from sources like SaaS, Binlog, and cloud components, as well as incorporating small-scale transformations before loading the data into the target storage. This trend has led to the emergence of several specialized companies worldwide, such as StreamSets, Attunity (acquired by Qlik), Fivetran, and SeaTunnel by the Apache Foundation. The pros and cons of the EtLT architecture are as follows: Advantages of EtLT Architecture Real-time data processing Support for complex data sources Cost reduction Flexibility and scalability Performance optimization Support for large models Data quality and governance Disadvantages of EtLT Architecture Technical complexity Dependence on target system capabilities Management and monitoring challenges Increased data change management complexity Dependency on tools and platforms Overall, in recent years, with the rise of data, real-time data warehouses, and large models, the EtLT architecture has gradually become mainstream worldwide in the field of data integration. For specific historical details, you can refer to the relevant content in the article, "ELT is dead, and EtLT will be the end of modern data processing architecture." Under this overarching trend, let's interpret the maturity model of the entire data integration track. Overall, there are four clear trends: In the trend of ETL evolving into EtLT, the focus of data integration has shifted from traditional batch processing to real-time data collection and batch-stream integrated data integration. The hottest scenarios have also shifted from past single-database batch integration scenarios to hybrid cloud, SaaS, and multiple data sources integrated in a batch-stream manner. Data complexity transformation has gradually shifted from traditional ETL tools to processing complex transformations in data warehouses. At the same time, support for automatic schema changes (Schema Evolution) in the case of DDL (field definition) changes during real-time data integration has also begun. Even adapting to DDL changes in lightweight transformations has become a trend. Support for data source types has expanded from files and traditional databases to include emerging data sources, open-source big data ecosystems, unstructured data systems, cloud databases, and support for large models. These are also the most common scenarios encountered in every enterprise, and in the future, real-time data warehouses, lakes, clouds, and large models will be used in different scenarios within each enterprise. In terms of core capabilities and performance, diversity of data sources, high accuracy, and ease of troubleshooting are the top priorities for most enterprises. Conversely, there are not many examination points for capabilities such as high throughput and high real-time performance. Regarding data virtualization, DataFabric, and ZeroETL mentioned in the report, let's delve into the interpretation of the data integration maturity model below. Chapter 2: Data Integration Maturity Model Interpretation Data Production The data production segment refers to how data is obtained, distributed, transformed, and stored within the context of data integration. This part poses the greatest workload and challenges in integrating data. When users in the industry use data integration tools, their primary consideration is whether the tools support integration with their databases, cloud services, and SaaS systems. If these tools do not support the user's proprietary systems, then additional costs are incurred for customizing interfaces or exporting data into compatible files, which can pose challenges to the timeliness and accuracy of data. Data Collection Most data integration tools now support batch collection, rate limiting, and HTTP collection. However, real-time data acquisition (CDC) and DDL change detection are still in their growth and popularity stages. Particularly, the ability to handle DDL changes in source systems is crucial. Real-time data processing is often interrupted by changes in source system structures. Effectively addressing the technical complexity of DDL changes remains a challenge, and various industry vendors are still exploring solutions. Data Transformation With the gradual decline of ETL architectures, complex business processing (e.g., Join, Group By) within integration tools has gradually faded into history. Especially in real-time scenarios, there is limited memory available for operations like stream window Join and aggregation. Therefore, most ETL tools are migrating towards ELT and EtLT architectures. Lightweight data transformation using SQL-like languages has become mainstream, allowing developers to perform data cleaning without having to learn various data integration tools. Additionally, the integration of data content monitoring and DDL change transformation processing, combined with notification, alerts, and automation, is making data transformation a more intelligent process. Data Distribution Traditional JDBC loading, HTTP, and bulk loading have become essential features of every mainstream data integration tool, with competition focusing on the breadth of data source support. Automated DDL changes reduce developers' workload and ensure the smooth execution of data integration tasks. Various vendors employ their methods to handle complex scenarios where data table definitions change. Integration with large models is emerging as a new trend, allowing internal enterprise data to interface with large models, though it is currently the domain of enthusiasts in some open-source communities. Data Storage Next-generation data integration tools come with caching capabilities. Previously, this caching existed locally, but now distributed storage and distributed checkpoint/snapshot technologies are used. Effective utilization of cloud storage is also becoming a new direction, especially in scenarios involving large data caches requiring data replay and recording. Data Structure Migration This part deals with whether automatic table creation and inspection can be performed during the data integration process. Automatic table creation involves automatically creating tables/data structures in the target system that are compatible with those in the source system. This significantly reduces the workload of data development engineers. Automatic schema inference is a more complex scenario. In the EtLT architecture, in the event of real-time data DDL changes or changes in data fields, automatic inference of their rationality allows users to identify issues with data integration tasks before they run. The industry is still in the experimentation phase regarding this aspect. Computational Model The computational model evolves with the changing landscape of ETL, ELT, and EtLT. It has transitioned from emphasizing computation in the early stages to focusing on transmission in the middle stages, and now emphasizes lightweight computation during real-time transmission: Offline Data Synchronization This has become the most basic data integration requirement for every enterprise. However, the performance varies under different architectures. Overall, ETL architecture tools have much lower performance than ELT and EtLT tools under conditions of large-scale data. Real-Time Data Synchronization With the popularity of real-time data warehouses and data lakes, real-time data synchronization has become an essential factor for every enterprise to consider when integrating data. More and more companies are beginning to use real-time synchronization. Batch-Streaming Integration New-generation data integration engines are designed from the outset to consider batch-stream integration, providing more effective synchronization methods for different enterprise scenarios. In contrast, most traditional engines were designed to focus on either real-time or offline scenarios, resulting in poor performance for batch data synchronization. Unified use of batch and streaming can perform better in data initialization and hybrid batch-stream environments. Cloud-Native Overseas data integration tools are more aggressive in this aspect because they are billed on a pay-as-you-go basis. Therefore, the ability to quickly obtain/release responsive computing resources for each task is the core competitiveness and profit source for every company. In contrast, progress in big data cloud-native integration in China is still relatively slow, so it remains a subject of exploration for only a few companies domestically. Data Types and Typical Scenarios File Collection This is a basic feature of every integration tool. However, unlike in the past, apart from standard text files, the collection of data in formats like Parquet and ORC has become standard. Big Data Collection With the popularity of emerging data sources such as Snowflake, Redshift, Hudi, Iceberg, ClickHouse, Doris, and StarRocks, traditional data integration tools are significantly lagging in this regard. Users in China and the United States are generally at the same level in terms of big data usage, hence requiring vendors to adapt to these emerging data sources. Binlog Collection This is a burgeoning industry in China, as it has replaced traditional tools like DataStage and Informatica during the process of informatization. However, the replacement of databases like Oracle and DB2 has not been as rapid, resulting in a large number of specialized Binlog data collection companies emerging to solve CDC problems overseas. Informatization Data Collection This is a scenario unique to China. With the process of informatization, numerous domestic databases have emerged. Whether these databases' batch and real-time collection can be adapted, presents a higher challenge for Chinese vendors. Sharding In most large enterprises, sharding is commonly used to reduce the pressure on databases. Therefore, whether data integration tools support sharding has become a standard feature of professional data integration tools. Message Queues Driven by data lakes and real-time data warehouses, everything related to real-time is booming. Message queues, as the representatives of enterprise real-time data exchange centers, have become indispensable options for advanced enterprises. Whether data integration tools support a sufficient number of memory/disk message queue types has become one of the hottest features. Unstructured Data Non-structural data sources such as MongoDB and Elasticsearch have become essential for enterprises. Data integration also supports such data sources correspondingly. Big Model Data Numerous startups worldwide are working on quickly interacting with enterprise data and large datasets. SaaS Integration This is a very popular feature overseas but has yet to generate significant demand in China. Data Unified Scheduling Integrating data integration with scheduling systems, especially coordinating real-time data through scheduling systems and subsequent data warehouse tasks, is essential for building real-time data warehouses. Real-Time Data Warehouse/Data Lake These are currently the most popular scenarios for enterprises. Real-time data entry into warehouses/lakes enables the advantages of next-generation data warehouses/lakes to be realized. Data Disaster Recovery Backup With the enhancement of data integration real-time capabilities and CDC support, integration in the traditional disaster recovery field has emerged. Some data integration and disaster recovery vendors have begun to work in each other's areas. However, due to significant differences in detail between disaster recovery and integration scenarios, vendors penetrating each other's domains may lack functionality and require iterative improvements over time. Operation and Monitoring In data integration, operation and monitoring are essential functionalities. Effective operation and monitoring significantly reduce the workload of system operation and development personnel in case of data issues. Flow Control Modern data integration tools control traffic from multiple aspects such as task parallelism, single-task JDBC parallelism, and single JDBC reading volume, ensuring minimal impact on source systems. Task/Table-Level Statistics Task-level and table-level synchronization statistics are crucial for managing operations and maintenance personnel during data integration processes. Step-By-Step Trial Run Due to support for real-time data, SaaS, and lightweight transformation, running a complex data flow directly becomes more complicated. Therefore, some advanced companies have introduced step-by-step trial run functionality for efficient development and operation. Table Change Event Capture This is an emerging feature in real-time data processing, allowing users to make changes or alerts in a predefined manner when table changes occur in the source system, thereby maximizing the stability of real-time data. Batch-Stream Integrated Scheduling After real-time CDC and stream processing, integration with traditional batch data warehouse tasks is inevitable. However, ensuring accurate startup of batch data without affecting data stream operation remains a challenge. This is why integration and batch-stream integrated scheduling are related. Intelligent Diagnosis/Tuning/Resource Optimization In cluster and cloud-native scenarios, effectively utilizing existing resources and recommending correct solutions in case of problems are hot topics among the most advanced data integration companies. However, achieving production-level intelligent applications may take some time. Core Capabilities There are many important functionalities in data integration, but the following points are the most critical. The lack of these capabilities may have a significant impact during enterprise usage. Full/Incremental Synchronization Separate full/incremental synchronization has become a necessary feature of every data integration tool. However, the automatic switch from full to incremental mode has not yet become widespread among small and medium-sized vendors, requiring manual switching by users. CDC Capture As enterprise demands for real-time data increase, CDC capture has become a core competitive advantage of data integration. The support for the CDC from multiple data sources, the requirements, and the impact of the CDC on source databases, often become the core competitiveness of data integration tools. Data Diversity Supporting multiple data sources has become a "red ocean competition" in data integration tools. Better support for users' existing system data sources often leads to a more advantageous position in business competition. Checkpoint Resumption Whether real-time and batch data integration supports checkpoint resumption is helpful in quickly recovering from error data scenes in many scenarios or assisting in recovery in some exceptional cases. However, only a few tools currently support this feature. Concurrency/Limiting Speed Data integration tools need to be highly concurrent when speed is required and effectively reduce the impact on source systems when slow. This has become a necessary feature of integration tools. Multi-Table Synchronization/Whole-Database Migration This refers not only to convenient selection in the interface but also to whether JDBC or existing integration tasks can be reused at the engine level, thereby making better use of existing resources and completing data integration quickly. Performance Optimization In addition to core capabilities, performance often represents whether users need more resources or whether the hardware and cloud costs of data integration tools are low enough. However, extreme performance is currently unnecessary, and it is often considered the third factor after interface support and core capabilities. Timeliness Minute-level integration has gradually exited the stage of history, and supporting second-level data integration has become a very popular feature. However, millisecond-level data integration scenarios are still relatively rare, mostly appearing in disaster recovery special scenarios. Data Scale Most scenarios currently involve Tb-level data integration, while Pb-level data integration is implemented by open-source tools used by Internet giants. Eb-level data integration will not appear in the short term. High Throughput High throughput mainly depends on whether integration tools can effectively utilize network and CPU resources to achieve the maximum value of theoretical data integration. In this regard, tools based on ELT and EtLT have obvious advantages over ETL tools. Distributed Integration Dynamic fault tolerance is more important than dynamic scaling and cloud-native. The ability of a large data integration task to automatically tolerate errors in hardware and network failure situations is a basic function when doing large-scale data integration. Scalability and cloud-native are derived requirements in this scenario. Accuracy How data integration ensures consistency is a complex task. In addition to using multiple technologies to ensure "Exactly Once" CRC verification is done. Third-party data quality inspection tools are also needed rather than just "self-certification." Therefore, data integration tools often cooperate with data scheduling tools to verify data accuracy. Stability This is the result of multiple functions. Ensuring the stability of individual tasks is important in terms of availability, task isolation, data isolation, permissions, and encryption control. When problems occur in a single task or department, they should not affect other tasks and departments. Ecology Excellent data integration tools have a large ecosystem that supports synchronization with multiple data sources and integration with upstream and downstream scheduling and monitoring systems. Moreover, tool usability is also an important indicator involving enterprise personnel costs. Chapter 3: Trends In the coming years, with the proliferation of the EtLT architecture, many new scenarios will emerge in data integration, while data virtualization and DataFabric will also have significant impacts on future data integration: Multi-Cloud Integration This is already widespread globally, with most data integrations having cross-cloud integration capabilities. In China, due to the limited prevalence of clouds, this aspect is still in the early incubation stage. ETL Integration As the ETL cycle declines, most enterprises will gradually migrate from tools like Kettle, Informatica, Talend, etc., to emerging EtLT architectures, thereby supporting batch-stream integrated data integration and more emerging data sources. ELT Currently, most mainstream big data architectures are based on ELT. With the rise of real-time data warehouses and data lakes, ELT-related tools will gradually upgrade to EtLT tools, or add real-time EtLT tools to compensate for the lack of real-time data support in ELT architectures. EtLT Globally, companies like JPMorgan, Shein, Shoppe, etc., are embedding themselves in the EtLT architecture. More companies will integrate their internal data integration tools into the EtLT architecture, combined with batch-stream integrated scheduling systems to meet enterprise DataOps-related requirements. Automated Governance With the increase in data sources and real-time data, traditional governance processes cannot meet the timeliness requirements for real-time analysis. Automated governance will gradually rise within enterprises in the next few years. Big Model Support As large models penetrate enterprise applications, providing data to large models becomes a necessary skill for data integration. Traditional ETL and ELT architectures are relatively difficult to adapt to real-time, large batch data scenarios, so the EtLT architecture will deepen its penetration into most enterprises along with the popularization of large models. ZeroETL This is a concept proposed by Amazon, suggesting that data stored on S3 can be accessed directly by various engines without the need for ETL between different engines. In a sense, if the data scenario is not complex, and the data volume is small, a small number of engines can meet the OLAP and OLTP requirements. However, due to limited scenario support and poor performance, it will take some time for more companies to recognize this approach. DataFabric Currently, many companies propose using DataFabric metadata to manage all data, eliminating the need for ETL/ELT during queries and directly accessing underlying data. This technology is still in the experimental stage, with significant challenges in query response and scenario adaptation. It can meet the needs of simple scenarios with small data queries, but for complex big data scenarios, the EtLT architecture will still be necessary for the foreseeable future. Data Virtualization The basic idea is similar to the execution layer of DataFabric. Data does not need to be moved; instead, it is queried directly through ad-hoc query interfaces and compute engines (e.g., Presto, TrinoDB) to translate data stored in underlying data storage or data engines. However, in the case of large amounts of data, engine query efficiency and memory consumption often fail to meet expectations, so it is only used in scenarios with small amounts of data. Conclusion From an overall trend perspective, with the explosive growth of global data, the emergence of large models, and the proliferation of data engines for various scenarios, the rise of real-time data has brought data integration back to the forefront of the data field. If data is considered a new energy source, then data integration is like the pipeline of this new energy. The more data engines there are, the higher the efficiency, data source compatibility, and usability requirements of the pipeline will be. Although data integration will eventually face challenges from Zero ETL, data virtualization, and DataFabric, in the visible future, the performance, accuracy, and ROI of these technologies have always failed to reach the level of popularity of data integration. Otherwise, the most popular data engines in the United States should not be SnowFlake or DeltaLake but TrinoDB. Of course, I believe that in the next 10 years, under the circumstances of DataFabric x large models, virtualization + EtLT + data routing may be the ultimate solution for data integration. In short, as long as data volume grows, the pipelines between data will always exist. Chapter 4: How To Use the Data Integration Maturity Model Firstly, the maturity model provides a comprehensive view of current and potential future technologies that may be utilized in data integration over the next 10 years. It offers individuals insight into personal skill development and assists enterprises in designing and selecting appropriate technological architectures. Additionally, it guides key development areas within the data integration industry. For enterprises, technology maturity aids in assessing the level of investment in a particular technology. For a mature technology, it is likely to have been in use for many years, supporting business operations effectively. However, as technological advancements reach a plateau, consideration can be given to adopting newer, more promising technologies to achieve higher business value. Technologies in decline are likely to face increasing limitations and issues in supporting business operations, gradually being replaced by newer technologies within 3-5 years. When introducing such technologies, it's essential to consider their business value and the current state of the enterprise. Popular technologies, on the other hand, are prioritized by enterprises due to their widespread validation among early adopters, with the majority of businesses and technology companies endorsing them. Their business value has been verified, and they are expected to dominate the market in the next 1-2 years. Growing technologies require consideration based on their business value, having passed the early adoption phase, and having their technological and business values validated by early adopters. They have not yet been fully embraced in the market due to reasons such as branding and promotion but are likely to become popular technologies and future industry standards. Forward-looking technologies are generally cutting-edge and used by early adopters, offering some business value. However, their general applicability and ROI have not been fully validated. Enterprises can consider limited adoption in areas where they provide significant business value. For individuals, mature and declining technologies offer limited learning and research value, as they are already widely adopted. Focusing on popular technologies can be advantageous for employment prospects, as they are highly sought after in the industry. However, competition in this area is fierce, requiring a certain depth of understanding to stand out. Growing technologies are worth delving into as they are likely to become popular in the future, and early experience can lead to expertise when they reach their peak popularity. Forward-looking technologies, while potentially leading to groundbreaking innovations, may also fail. Individuals may choose to invest time and effort based on personal interests. While these technologies may be far from job requirements and practical application, forward-thinking companies may inquire about them during interviews to assess the candidate's foresight. Definitions of Technological Maturity Forward-looking: Technologies are still in the research and development stage, with the community exploring their practical applications and potential market value. Although the industry's understanding of these technologies is still shallow, high-value demands have been identified. Growing: Technologies begin to enter the practical application stage, with increasing competition in the market and parallel development of various technological paths. The community focuses on overcoming challenges in practical applications and maximizing their commercial value, although their value in business is not fully realized. Popular: Technology development reaches its peak, with the community striving to maximize technological performance. Industry attention peaks and the technology begins to demonstrate significant commercial value. Declining: Technology paths begin to show clear advantages and disadvantages, with the market demanding higher optimization and integration. The industry begins to recognize the limitations and boundaries of technology in enhancing business value. Mature: Technology paths tend to unify and standardize, with the community focusing on reducing costs and improving efficiency. The industry also focuses on cost-effectiveness analysis to evaluate the priority and breadth of technology applications. Definitions of Business Value 5 stars: The cost reduction/revenue contribution of relevant technologies/business units accounts for 50% or more of the department's total revenue, or is managed by senior directors or higher-level executives (e.g., VPs). 4 stars: The cost reduction/revenue contribution of relevant technologies/business units accounts for between 40% and 50% of the department's total revenue, or is managed by directors. 3 stars: The cost reduction/revenue contribution of relevant technologies/business units accounts for between 30% and 40% of the department's total revenue, or is managed by senior managers. 2 stars: The cost reduction/revenue contribution of relevant technologies/business units accounts for between 20% and 30% of the department's total revenue, or is managed by managers. 1 star: The cost reduction/revenue contribution of relevant technologies/business units accounts for between 5% and 20% of the department's total revenue, or is managed by supervisors. Definitions of Technological Difficulty 5 stars: Invested in top industry expert teams for over 12 months 4 stars: Invested in industry experts or senior architects for over 12 months 3 stars: Invested in architect teams for approximately 6 months 2 stars: Invested in senior programmer teams for 1-3 months 1 star: Invested in ordinary programmer teams for 1-3 months
In today's security landscape, OAuth2 has become a standard for securing APIs, providing a more robust and flexible approach than basic authentication. My journey into this domain began with a critical solution architecture decision: migrating from basic authentication to OAuth2 client credentials for obtaining access tokens. While Spring Security offers strong support for both authentication methods, I encountered a significant challenge. I could not find a declarative approach that seamlessly integrated basic authentication and JWT authentication within the same application. This gap in functionality motivated me to explore and develop a solution that not only meets the authentication requirements but also supports comprehensive integration testing. This article shares my findings and provides a detailed guide on setting up Keycloak, integrating it with Spring Security and Spring Boot, and utilizing the Spock Framework for repeatable integration tests. By the end of this article, you will clearly understand how to configure and test your authentication mechanisms effectively with Keycloak as an identity provider, ensuring a smooth transition to OAuth2 while maintaining the flexibility to support basic authentication where necessary. Prerequisites Before you begin, ensure you have met the following requirements: You have installed Java 21. You have a basic understanding of Maven and Java. This is the parent project for the taptech-code-accelerator modules. It manages common dependencies and configurations for all the child modules. You can get it from here taptech-code-accelerator. Building taptech-code-accelerator To build the taptech-code-accelerator project, follow these steps: git clone the project from the repository: git clone https://github.com/glawson6/taptech-code-accelerator.git Open a terminal and change the current directory to the root directory of the taptech-code-accelerator project. cd path/to/taptech-code-accelerator Run the following command to build the project: ./build.sh This command cleans the project, compiles the source code, runs any tests, packages the compiled code into a JAR or WAR file, and installs the packaged code in your local Maven repository. It also builds the local Docker image that will be used to run later. Please ensure you have the necessary permissions to execute these commands. Keycloak Initial Setup Setting up Keycloak for integration testing involves several steps. This guide will walk you through creating a local environment configuration, starting Keycloak with Docker, configuring realms and clients, verifying the setup, and preparing a PostgreSQL dump for your integration tests. Step 1: Create a local.env File First, navigate to the taptech-common/src/test/resources/docker directory. Create a local.env file to store environment variables needed for the Keycloak service. Here's an example of what the local.env file might look like: POSTGRES_DB=keycloak POSTGRES_USER=keycloak POSTGRES_PASSWORD=admin KEYCLOAK_ADMIN=admin KEYCLOAK_ADMIN_PASSWORD=admin KC_DB_USERNAME=keycloak KC_DB_PASSWORD=keycloak SPRING_PROFILES_ACTIVE=secure-jwk KEYCLOAK_ADMIN_CLIENT_SECRET=DCRkkqpUv3XlQnosjtf8jHleP7tuduTa IDP_PROVIDER_JWKSET_URI=http://172.28.1.90:8080/realms/offices/protocol/openid-connect/certs Step 2: Start the Keycloak Service Next, start the Keycloak service using the provided docker-compose.yml file and the ./start-services.sh script. The docker-compose.yml file should define the Keycloak and PostgreSQL services. version: '3.8' services: postgres: image: postgres volumes: - postgres_data:/var/lib/postgresql/data #- ./dump:/docker-entrypoint-initdb.d environment: POSTGRES_DB: keycloak POSTGRES_USER: ${KC_DB_USERNAME} POSTGRES_PASSWORD: ${KC_DB_PASSWORD} networks: node_net: ipv4_address: 172.28.1.31 keycloak: image: quay.io/keycloak/keycloak:23.0.6 command: start #--import-realm environment: KC_HOSTNAME: localhost KC_HOSTNAME_PORT: 8080 KC_HOSTNAME_STRICT_BACKCHANNEL: false KC_HTTP_ENABLED: true KC_HOSTNAME_STRICT_HTTPS: false KC_HEALTH_ENABLED: true KEYCLOAK_ADMIN: ${KEYCLOAK_ADMIN} KEYCLOAK_ADMIN_PASSWORD: ${KEYCLOAK_ADMIN_PASSWORD} KC_DB: postgres KC_DB_URL: jdbc:postgresql://172.28.1.31/keycloak KC_DB_USERNAME: ${KC_DB_USERNAME} KC_DB_PASSWORD: ${KC_DB_PASSWORD} ports: - 8080:8080 volumes: - ./realms:/opt/keycloak/data/import restart: always depends_on: - postgres networks: node_net: ipv4_address: 172.28.1.90 volumes: postgres_data: driver: local networks: node_net: ipam: driver: default config: - subnet: 172.28.0.0/16 Then, use the ./start-services.sh script to start the services: Step 3: Access Keycloak Admin Console Once Keycloak has started, log in to the admin console at http://localhost:8080 using the configured admin username and password (default is admin/admin). Step 4: Create a Realm and Client Create a Realm: Log in to the Keycloak admin console. In the left-hand menu, click on "Add Realm". Enter the name of the realm (e.g., offices) and click "Create". Create a Client: Select your newly created realm from the left-hand menu. Click on "Clients" in the left-hand menu. Click on "Create" in the right-hand corner. Enter the client ID (e.g., offices), choose openid-connect as the client protocol, and click "Save." Click "Save." Extract the admin-cli Client Secret: Follow directions in the doc EXTRACTING-ADMIN-CLI-CLIENT-SECRET.md to extract the admin-cli client secret. Save the client secret for later use. Step 5: Verify the Setup With HTTP Requests To verify the setup, you can use HTTP requests to obtain tokens. Get access token: http -a admin-cli:[client secret] --form POST http://localhost:8080/realms/master/protocol/openid-connect/token grant_type=password username=admin password=Pa55w0rd Step 6: Create a PostgreSQL Dump After verifying the setup, create a PostgreSQL dump of the Keycloak database to use for seeding the database during integration tests. Create the dump: docker exec -i docker-postgres-1 /bin/bash -c "PGPASSWORD=keycloak pg_dump --username keycloak keycloak" > dump/keycloak-dump.sql Save the file: Save the keycloak-dump.sql file locally. This file will be used to seed the database for integration tests. Following these steps, you will have a Keycloak instance configured and ready for integration testing with Spring Security and the Spock Framework. Spring Security and Keycloak Integration Tests This section will set up integration tests for Spring Security and Keycloak using Spock and Testcontainers. This involves configuring dependencies, setting up Testcontainers for Keycloak and PostgreSQL, and creating a base class to hold the necessary configurations. Step 1: Add Dependencies First, add the necessary dependencies to your pom.xml file. Ensure that Spock, Testcontainers for Keycloak and PostgreSQL, and other required libraries are included (check here). Step 2: Create the Base Test Class Create a base class to hold the configuration for your integration tests. package com.taptech.common.security.keycloak import com.taptech.common.security.user.InMemoryUserContextPermissionsService import com.fasterxml.jackson.databind.ObjectMapper import dasniko.testcontainers.keycloak.KeycloakContainer import org.keycloak.admin.client.Keycloak import org.slf4j.Logger import org.slf4j.LoggerFactory import org.springframework.beans.factory.annotation.Autowired import org.springframework.context.annotation.Bean import org.springframework.context.annotation.Configuration import org.testcontainers.containers.Network import org.testcontainers.containers.PostgreSQLContainer import org.testcontainers.containers.output.Slf4jLogConsumer import org.testcontainers.containers.wait.strategy.ShellStrategy import org.testcontainers.utility.DockerImageName import org.testcontainers.utility.MountableFile import spock.lang.Shared import spock.lang.Specification import spock.mock.DetachedMockFactory import java.time.Duration import java.time.temporal.ChronoUnit class BaseKeyCloakInfraStructure extends Specification { private static final Logger logger = LoggerFactory.getLogger(BaseKeyCloakInfraStructure.class); static String jdbcUrlFormat = "jdbc:postgresql://%s:%s/%s" static String keycloakBaseUrlFormat = "http://%s:%s" public static final String OFFICES = "offices"; public static final String POSTGRES_NETWORK_ALIAS = "postgres"; @Shared static Network network = Network.newNetwork(); @Shared static PostgreSQLContainer<?> postgres = createPostgresqlContainer() protected static PostgreSQLContainer createPostgresqlContainer() { PostgreSQLContainer container = new PostgreSQLContainer<>("postgres") .withNetwork(network) .withNetworkAliases(POSTGRES_NETWORK_ALIAS) .withCopyFileToContainer(MountableFile.forClasspathResource("postgres/keycloak-dump.sql"), "/docker-entrypoint-initdb.d/keycloak-dump.sql") .withUsername("keycloak") .withPassword("keycloak") .withDatabaseName("keycloak") .withLogConsumer(new Slf4jLogConsumer(logger)) .waitingFor(new ShellStrategy() .withCommand( "psql -q -o /dev/null -c \"SELECT 1\" -d keycloak -U keycloak") .withStartupTimeout(Duration.of(60, ChronoUnit.SECONDS))) return container } public static final DockerImageName KEYCLOAK_IMAGE = DockerImageName.parse("bitnami/keycloak:23.0.5"); @Shared public static KeycloakContainer keycloakContainer; @Shared static String adminCC = "admin@cc.com" def setup() { } // run before every feature method def cleanup() {} // run after every feature method def setupSpec() { postgres.start() String jdbcUrl = String.format(jdbcUrlFormat, POSTGRES_NETWORK_ALIAS, 5432, postgres.getDatabaseName()); keycloakContainer = new KeycloakContainer("quay.io/keycloak/keycloak:23.0.6") .withNetwork(network) .withExposedPorts(8080) .withEnv("KC_HOSTNAME", "localhost") .withEnv("KC_HOSTNAME_PORT", "8080") .withEnv("KC_HOSTNAME_STRICT_BACKCHANNEL", "false") .withEnv("KC_HTTP_ENABLED", "true") .withEnv("KC_HOSTNAME_STRICT_HTTPS", "false") .withEnv("KC_HEALTH_ENABLED", "true") .withEnv("KEYCLOAK_ADMIN", "admin") .withEnv("KEYCLOAK_ADMIN_PASSWORD", "admin") .withEnv("KC_DB", "postgres") .withEnv("KC_DB_URL", jdbcUrl) .withEnv("KC_DB_USERNAME", "keycloak") .withEnv("KC_DB_PASSWORD", "keycloak") keycloakContainer.start() String authServerUrl = keycloakContainer.getAuthServerUrl(); String adminUsername = keycloakContainer.getAdminUsername(); String adminPassword = keycloakContainer.getAdminPassword(); logger.info("Keycloak getExposedPorts: {}", keycloakContainer.getExposedPorts()) String keycloakBaseUrl = String.format(keycloakBaseUrlFormat, keycloakContainer.getHost(), keycloakContainer.getMappedPort(8080)); //String keycloakBaseUrl = "http://localhost:8080" logger.info("Keycloak authServerUrl: {}", authServerUrl) logger.info("Keycloak URL: {}", keycloakBaseUrl) logger.info("Keycloak adminUsername: {}", adminUsername) logger.info("Keycloak adminPassword: {}", adminPassword) logger.info("JDBC URL: {}", jdbcUrl) System.setProperty("spring.datasource.url", jdbcUrl) System.setProperty("spring.datasource.username", postgres.getUsername()) System.setProperty("spring.datasource.password", postgres.getPassword()) System.setProperty("spring.datasource.driverClassName", "org.postgresql.Driver"); System.setProperty("POSTGRES_URL", jdbcUrl) System.setProperty("POSRGRES_USER", postgres.getUsername()) System.setProperty("POSRGRES_PASSWORD", postgres.getPassword()); System.setProperty("idp.provider.keycloak.base-url", authServerUrl) System.setProperty("idp.provider.keycloak.admin-client-secret", "DCRkkqpUv3XlQnosjtf8jHleP7tuduTa") System.setProperty("idp.provider.keycloak.admin-client-id", KeyCloakConstants.ADMIN_CLI) System.setProperty("idp.provider.keycloak.admin-username", adminUsername) System.setProperty("idp.provider.keycloak.admin-password", adminPassword) System.setProperty("idp.provider.keycloak.default-context-id", OFFICES) System.setProperty("idp.provider.keycloak.client-secret", "x9RIGyc7rh8A4w4sMl8U5rF3HuNm2wOC3WOD") System.setProperty("idp.provider.keycloak.client-id", OFFICES) System.setProperty("idp.provider.keycloak.token-uri", "/realms/offices/protocol/openid-connect/token") System.setProperty("idp.provider.keycloak.jwkset-uri", authServerUrl + "/realms/offices/protocol/openid-connect/certs") System.setProperty("idp.provider.keycloak.issuer-url", authServerUrl + "/realms/offices") System.setProperty("idp.provider.keycloak.admin-token-uri", "/realms/master/protocol/openid-connect/token") System.setProperty("idp.provider.keycloak.user-uri", "/admin/realms/{realm}/users") System.setProperty("idp.provider.keycloak.use-strict-jwt-validators", "false") } // run before the first feature method def cleanupSpec() { keycloakContainer.stop() postgres.stop() } // run after @Autowired Keycloak keycloak @Autowired KeyCloakAuthenticationManager keyCloakAuthenticationManager @Autowired InMemoryUserContextPermissionsService userContextPermissionsService @Autowired KeyCloakManagementService keyCloakService @Autowired KeyCloakIdpProperties keyCloakIdpProperties @Autowired KeyCloakJwtDecoderFactory keyCloakJwtDecoderFactory def test_config() { expect: keycloak != null keyCloakAuthenticationManager != null keyCloakService != null } static String basicAuthCredsFrom(String s1, String s2) { return "Basic " + toBasicAuthCreds(s1, s2); } static toBasicAuthCreds(String s1, String s2) { return Base64.getEncoder().encodeToString((s1 + ":" + s2).getBytes()); } @Configuration @EnableKeyCloak public static class TestConfig { @Bean ObjectMapper objectMapper() { return new ObjectMapper(); } DetachedMockFactory mockFactory = new DetachedMockFactory() } } In the BaseKeyCloakInfraStructure class, a method named createPostgresqlContainer() is used to set up a PostgreSQL test container. This method configures the container with various settings, including network settings, username, password, and database name. This class sets up the entire Postgresql and Keycloak env. One of the key steps in this method is the use of a PostgreSQL dump file to populate the database with initial data. This is done using the withCopyFileToContainer() method, which copies a file from the classpath to a specified location within the container. If you have problems starting, you might need to restart the Docker Compose file and extract the client secret. This is explained in EXTRACTING-ADMIN-CLI-CLIENT-SECRET. The code snippet for this is: .withCopyFileToContainer(MountableFile.forClasspathResource("postgres/keycloak-dump.sql"), "/docker-entrypoint-initdb.d/keycloak-dump.sql") Step 3: Extend the Base Class End Run Your Tests package com.taptech.common.security.token import com.taptech.common.EnableCommonConfig import com.taptech.common.security.keycloak.BaseKeyCloakInfraStructure import com.taptech.common.security.keycloak.EnableKeyCloak import com.taptech.common.security.keycloak.KeyCloakAuthenticationManager import com.taptech.common.security.user.UserContextPermissions import com.taptech.common.security.utils.SecurityUtils import com.fasterxml.jackson.databind.ObjectMapper import org.slf4j.Logger import org.slf4j.LoggerFactory import org.springframework.beans.factory.annotation.Autowired import org.springframework.boot.test.autoconfigure.web.reactive.WebFluxTest import org.springframework.context.annotation.Bean import org.springframework.context.annotation.Configuration import org.springframework.security.oauth2.client.registration.InMemoryReactiveClientRegistrationRepository import org.springframework.test.context.ContextConfiguration import org.springframework.test.web.reactive.server.EntityExchangeResult import org.springframework.test.web.reactive.server.WebTestClient import spock.mock.DetachedMockFactory import org.springframework.boot.autoconfigure.security.reactive.ReactiveSecurityAutoConfiguration @ContextConfiguration(classes = [TestApiControllerConfig.class]) @WebFluxTest(/*controllers = [TokenApiController.class],*/ properties = [ "spring.main.allow-bean-definition-overriding=true", "openapi.token.base-path=/", "idp.provider.keycloak.initialize-on-startup=true", "idp.provider.keycloak.initialize-realms-on-startup=false", "idp.provider.keycloak.initialize-users-on-startup=true", "spring.test.webtestclient.base-url=http://localhost:8888" ], excludeAutoConfiguration = ReactiveSecurityAutoConfiguration.class) class TokenApiControllerTest extends BaseKeyCloakInfraStructure { private static final Logger logger = LoggerFactory.getLogger(TokenApiControllerTest.class); /* ./mvnw clean test -Dtest=TokenApiControllerTest ./mvnw clean test -Dtest=TokenApiControllerTest#test_public_validate */ @Autowired TokenApiApiDelegate tokenApiDelegate @Autowired KeyCloakAuthenticationManager keyCloakAuthenticationManager @Autowired private WebTestClient webTestClient @Autowired TokenApiController tokenApiController InMemoryReactiveClientRegistrationRepository clientRegistrationRepository def test_configureToken() { expect: tokenApiDelegate } def test_public_jwkkeys() { expect: webTestClient.get().uri("/public/jwkKeys") .exchange() .expectStatus().isOk() .expectBody() } def test_public_login() { expect: webTestClient.get().uri("/public/login") .headers(headers -> { headers.setBasicAuth(BaseKeyCloakInfraStructure.adminCC, "admin") }) .exchange() .expectStatus().isOk() .expectBody() .jsonPath(".access_token").isNotEmpty() .jsonPath(".refresh_token").isNotEmpty() } def test_public_login_401() { expect: webTestClient.get().uri("/public/login") .headers(headers -> { headers.setBasicAuth(BaseKeyCloakInfraStructure.adminCC, "bad") }) .exchange() .expectStatus().isUnauthorized() } def test_public_refresh_token() { given: def results = keyCloakAuthenticationManager.passwordGrantLoginMap(BaseKeyCloakInfraStructure.adminCC, "admin", OFFICES).toFuture().join() def refreshToken = results.get("refresh_token") expect: webTestClient.get().uri("/public/refresh") .headers(headers -> { headers.set("Authorization", SecurityUtils.toBearerHeaderFromToken(refreshToken)) headers.set("contextId", OFFICES) }) .exchange() .expectStatus().isOk() .expectBody() .jsonPath(".access_token").isNotEmpty() .jsonPath(".refresh_token").isNotEmpty() } def test_public_validate() { given: def results = keyCloakAuthenticationManager.passwordGrantLoginMap(BaseKeyCloakInfraStructure.adminCC, "admin", OFFICES).toFuture().join() def accessToken = results.get("access_token") expect: EntityExchangeResult<UserContextPermissions> entityExchangeResult = webTestClient.get().uri("/public/validate") .headers(headers -> { headers.set("Authorization", SecurityUtils.toBearerHeaderFromToken(accessToken)) }) .exchange() .expectStatus().isOk() .expectBody(UserContextPermissions.class) .returnResult() logger.info("entityExchangeResult: {}", entityExchangeResult.getResponseBody()) } @Configuration @EnableCommonConfig @EnableKeyCloak @EnableTokenApi public static class TestApiControllerConfig { @Bean ObjectMapper objectMapper() { return new ObjectMapper(); } DetachedMockFactory mockFactory = new DetachedMockFactory() } } Conclusion With this setup, you have configured Testcontainers to run Keycloak and PostgreSQL within a Docker network, seeded the PostgreSQL database with a dump file, and created a base test class to manage the lifecycle of these containers. You can now write your integration tests extending this base class to ensure your Spring Security configuration works correctly with Keycloak.
John Vester
Senior Staff Engineer,
Marqeta
Alexey Shepelev
Senior Full-stack Developer,
BetterUp
Saurabh Dashora
Founder,
ProgressiveCoder