Internal Developer Portals are reshaping the developer experience. What's the best way to get started? Do you build or buy? Tune in to see.
Agentic AI. It's everywhere. But what does that mean for developers? Learn to leverage agentic AI to improve efficiency and innovation.
React’s Unstoppable Rise: Why It’s Here to Stay
Upcoming DZone Events
Observability and Performance
The dawn of observability across the software ecosystem has fully disrupted standard performance monitoring and management. Enhancing these approaches with sophisticated, data-driven, and automated insights allows your organization to better identify anomalies and incidents across applications and wider systems. While monitoring and standard performance practices are still necessary, they now serve to complement organizations' comprehensive observability strategies. This year's Observability and Performance Trend Report moves beyond metrics, logs, and traces — we dive into essential topics around full-stack observability, like security considerations, AIOps, the future of hybrid and cloud-native observability, and much more.
Getting Started With Agentic AI
Java Application Containerization and Deployment
Efficient database management is vital for handling large datasets while maintaining optimal performance and ease of maintenance. Table partitioning in PostgreSQL is a robust method for logically dividing a large table into smaller, manageable pieces called partitions. This technique helps improve query performance, simplify maintenance tasks, and reduce storage costs. This article delves deeply into creating and managing table partitioning in PostgreSQL, focusing on the pg_partman extension for time-based and serial-based partitioning. The types of partitions supported in PostgreSQL are discussed in detail, along with real-world use cases and practical examples to illustrate their implementation. Introduction Modern applications generate massive amounts of data, requiring efficient database management strategies to handle these volumes. Table partitioning is a technique where a large table is divided into smaller, logically related segments. PostgreSQL offers a robust partitioning framework to manage such datasets effectively. Why Partitioning? Improved query performance. Queries can quickly skip irrelevant partitions using constraint exclusion or query pruning.Simplified maintenance. Partition-specific operations such as vacuuming or reindexing can be performed on smaller datasets.Efficient archiving. Older partitions can be dropped or archived without impacting the active dataset.Scalability. Partitioning enables horizontal scaling, particularly in distributed environments. Native vs Extension-Based Partitioning PostgreSQL's native declarative partitioning simplifies many aspects of partitioning, while extensions like pg_partman provide additional automation and management capabilities, particularly for dynamic use cases. Native Partitioning vs pg_partman FeatureNative Partitioningpg_partmanAutomationLimitedComprehensivePartition TypesRange, List, HashTime, Serial (advanced)MaintenanceManual scripts requiredAutomatedEase of UseRequires SQL expertiseSimplified Types of Table Partitioning in PostgreSQL PostgreSQL supports three primary partitioning strategies: Range, List, and Hash. Each has unique characteristics suitable for different use cases. Range Partitioning Range partitioning divides a table into partitions based on a range of values in a specific column, often a date or numeric column. Example: Monthly sales data SQL CREATE TABLE sales ( sale_id SERIAL, sale_date DATE NOT NULL, amount NUMERIC ) PARTITION BY RANGE (sale_date); CREATE TABLE sales_2023_01 PARTITION OF sales FOR VALUES FROM ('2023-01-01') TO ('2023-02-01'); Advantages Efficient for time-series data like logs or transactionsSupports sequential queries, such as retrieving data for specific months Disadvantages Requires predefined ranges, which may lead to frequent schema updates List Partitioning List partitioning divides data based on a discrete set of values, such as regions or categories. Example: Regional orders SQL CREATE TABLE orders ( order_id SERIAL, region TEXT NOT NULL, amount NUMERIC ) PARTITION BY LIST (region); CREATE TABLE orders_us PARTITION OF orders FOR VALUES IN ('US'); CREATE TABLE orders_eu PARTITION OF orders FOR VALUES IN ('EU'); Advantages Ideal for datasets with a finite number of categories (e.g., regions, departments)Straightforward to manage for a fixed set of partitions Disadvantages Not suitable for dynamic or expanding categories Hash Partitioning Hash partitioning distributes rows across a set of partitions using a hash function. This ensures an even distribution of data. Example: User accounts SQL CREATE TABLE users ( user_id SERIAL, username TEXT NOT NULL ) PARTITION BY HASH (user_id); CREATE TABLE users_partition_0 PARTITION OF users FOR VALUES WITH (MODULUS 4, REMAINDER 0); Advantages Ensures balanced distribution across partitions, preventing hotspotsSuitable for evenly spread workloads Disadvantages Not human-readable; partitions cannot be identified intuitively pg_partman: A Comprehensive Guide pg_partman is a PostgreSQL extension that simplifies partition management, particularly for time-based and serial-based datasets. Installation and Setup pg_partman requires installation as an extension in PostgreSQL. It provides a suite of functions to create and manage partitioned tables dynamically. Install using your package manager: Shell sudo apt-get install postgresql-pg-partman Create the extension in your database: SQL CREATE EXTENSION pg_partman; Configuring Partitioning pg_partman supports time-based and serial-based partitioning, which are particularly useful for datasets with temporal data or sequential identifiers. Time-Based Partitioning Example SQL CREATE TABLE logs ( id SERIAL, log_time TIMESTAMP NOT NULL, message TEXT ); SELECT partman.create_parent( p_parent_table := 'public.logs', p_control := 'log_time', p_type := 'time', p_interval := 'daily' ); This configuration: Automatically creates daily partitionsSimplifies querying and maintenance for log data Serial-Based Partitioning Example SQL CREATE TABLE transactions ( transaction_id BIGSERIAL PRIMARY KEY, details TEXT NOT NULL ); SELECT partman.create_parent( p_parent_table := 'public.transactions', p_control := 'transaction_id', p_type := 'serial', p_interval := 100000 ); This creates partitions every 100,000 rows, ensuring the parent table remains manageable. Automation Features Automatic Maintenance Use run_maintenance() to ensure future partitions are pre-created: SQL SELECT partman.run_maintenance(); Retention Policies Define retention periods to drop old partitions automatically: SQL UPDATE partman.part_config SET retention = '12 months' WHERE parent_table = 'public.logs'; Advantages of pg_partman Simplifies dynamic partition creationAutomates cleanup and maintenanceReduces the need for manual schema updates Practical Use Cases for Table Partitioning Log management. High-frequency logs partitioned by day for easy archival and querying.Multi-regional data. E-commerce systems dividing orders by region for improved scalability.Time-series data. IoT applications with partitioned telemetry data. Log Management Partition logs by day or month to manage high-frequency data efficiently. SQL SELECT partman.create_parent( p_parent_table := 'public.server_logs', p_control := 'timestamp', p_type := 'time', p_interval := 'monthly' ); Multi-Regional Data Partition sales or inventory data by region for better scalability. SQL CREATE TABLE sales ( sale_id SERIAL, region TEXT NOT NULL ) PARTITION BY LIST (region); High-Volume Transactions Partition transactions by serial ID to avoid bloated indexes. SQL SELECT partman.create_parent( p_parent_table := 'public.transactions', p_control := 'transaction_id', p_type := 'serial', p_interval := 10000 ); Conclusion Table partitioning is an indispensable technique for managing large datasets. PostgreSQL’s built-in features, combined with the pg_partman extension, make implementing dynamic and automated partitioning strategies easier. These tools allow database administrators to enhance performance, simplify maintenance, and scale effectively. Partitioning is a cornerstone for modern database management, especially in high-volume applications. Understanding and applying these concepts ensures robust and scalable database systems.
Aspect-oriented programming (AOP) is a programming paradigm that enables the modularisation of concerns that cut across multiple types and objects. It provides additional behavior to existing code without modifying the code itself. AOP can solve many problems in a graceful way that is easy to maintain. One such common problem is adding some new behavior to a controller (@Controller) so that it works “outside” the main logic of the controller. In this article, we will look at how to use AOP to add logic when an application returns a successful response (HTTP 200). An entity should be deleted after it is returned to a client. This can relate to applications that, for some reason (e.g., legal), cannot store data for a long time and should delete it once it is processed. We will be using AspectJ in the Spring application. AspectJ is an implementation of AOP for Java and has good integration with Spring. Before that, you can find more about AOP in Spring here. Possible Solutions To achieve our goal and delete an entity after the logic in the controller was executed we can use several approaches. We can implement an interceptor (HandlerInterceptor) or a filter (OncePerRequestFilter). Spring components can be leveraged to work with HTTP requests and responses. This requires some studying and understanding this part of Spring. Another way to solve the problem is to use AOP and its implementation in Java — AspectJ. AOP provides a possibility to reach the solution in a laconic way that is very easy to implement and maintain. It allows you to avoid digging into Spring implementation to solve this trivial task. AOP is a middleware solution and complements Spring. Implementation Let’s say we have a CardInfo entity that contains sensitive information that we cannot store for a long time in the database, and we are obliged to delete the entity after we process it. For simplicity, by processing, we will understand just returning the data to a client that makes a REST request to our application. We want the entity to be deleted right after it was successfully read with a GET request. We need to create a Spring Component and annotate it with @Aspect. Java @Aspect @Component @RequiredArgsConstructor @ConditionalOnExpression("${aspect.cardRemove.enabled:false}") public class CardRemoveAspect { private final CardInfoRepository repository; @Pointcut("execution(* com.cards.manager.controllers.CardController.getCard(..)) && args(id)") public void cardController(String id) { } @AfterReturning(value = "cardController(id)", argNames = "id") public void deleteCard(String id) { repository.deleteById(id); } } @Component – marks the class as a Spring component so that it can be managed by the Spring IoC.@Aspect – indicates that this class is an aspect. It is automatically detected by Spring and used to configure Spring AOP.@Pointcut – indicates a predicate that matches join points (points during the execution of a program).execution() – represents the execution of any method within the defined package (in our case the exact method name was set).@AfterReturning – advice to be run after a join point completes normally (without throwing an exception). I also annotated the class with @ConditionalOnExpression to be able to switch on/off this functionality from properties. This small piece of code with a couple of one-liner methods does the job that we are interested in. The cardController(String id) method defines the exact place/moment where the logic defined in the deleteCard(String id) method is executed. In our case, it is the getCard() method in the CardController class that is placed in com.cards.manager.controllers package. deleteCard(String id) contains the logic of the advice. In this case, we call CardInfoRepository to delete the entity by id. Since CardRemoveAspect is a Spring Component, one can easily inject other components into it. Java @Repository public interface CardInfoRepository extends CrudRepository<CardInfoEntity, String> { } @AfterReturning shows that the logic should be executed after a successful exit from the method defined in cardController(String id). CardController looks as follows: Java @RestController @RequiredArgsConstructor @RequestMapping( "/api/cards") public class CardController { private final CardService cardService; private final CardInfoConverter cardInfoConverter; @GetMapping("/{id}") public ResponseEntity<CardInfoResponseDto> getCard(@PathVariable("id") String id) { return ResponseEntity.ok(cardInfoConverter.toDto(cardService.getCard(id))); } } Conclusion AOP represents a very powerful approach to solving many problems that would be hard to achieve without it or difficult to maintain. It provides a convenient way to work with and around the web layer without the necessity to dig into Spring configuration details. To view the full example application where AOP was used, as shown in this article, read my other article on creating a service for sensitive data using Spring and Redis. The source code of the full version of this service is available on GitHub.
AWS Database Migration Service is a cloud service that migrates relational databases, NoSQL databases, data warehouses, and all other types of data stores into AWS Cloud or between cloud and on-premises setups efficiently and securely. DMS supports several types of source and target databases such as Oracle, MS SQL Server, MySQL, Postgres SQL, Amazon Aurora, AWS RDS, Redshift, and S3, etc. Observations During the Data Migration We worked on designing and creating an AWS S3 data lake and data warehouse in AWS Redshift with the data sources from on-premises for Oracle, MS SQL Server, MySQL, Postgres SQL, and MongoDB for relational databases. We used AWS DMS for the initial full load and daily incremental data transfer from these sources into AWS S3. With this series of posts, I want to explain the various challenges faced during the actual data migration with different relational databases. 1. Modified Date Not Populated Properly at the Source AWS DMS is used for full load and change data capture from source databases. AWS DMS captures changed records based on the transaction logs, but a modified date column updated properly can help to apply deduplication logic, and extract the latest modified record for a given row on the target in S3. In case modified data is not available for a table or it is not updated properly, AWS DMS provides an option of transformation rules to add a new column while extracting data from the source database. Here, the AR_H_CHANGE_SEQ header helps to add a new column with value as a unique incrementing number from the source database, which consists of a timestamp and an auto-incrementing number. The below code example adds a new column as DMS_CHANGE_SEQ to the target, which has a unique incrementing number from the source. This is a 35-digit unique number with the first 16 digits for the timestamp and the next 19 digits for the record ID number incremented by the database. JSON { "rule-type": "transformation", "rule-id": "2", "rule-name": "2", "rule-target": "column", "object-locator": { "schema-name": "%", "table-name": "%" }, "rule-action": "add-column", "value": "DMS_CHANGE_SEQ", "expression": "$AR_H_CHANGE_SEQ", "data-type": { "type": "string", "length": 100 } } 2. Enabling Supplemental Logging for Oracle as a Source For Oracle as a source database, to capture ongoing changes, AWS DMS needs minimum supplemental logging to be enabled on the source database. Accordingly, this will include additional information and columns in the redo logs to identify the changes at the source. Supplemental logging can be enabled for primary, unique keys, sets of columns, or all the columns. Supplemental logging for all columns captures all the columns for the tables in the source database and helps to overwrite the complete records in the target AWS S3 layer. Supplemental logging of all columns will increase the redo logs size, as all the columns for the table are logged into the logs. One needs to configure, redo, and archive logs accordingly to consider additional information in them. 3. Network Bandwidth Between Source and Target Databases Initial full load from the on-premises sources for Oracle, MS SQL Server, etc., worked fine and changed data capture, too, for most of the time. There used to be a moderate number of transactions most of the time of the day in a given month, except for the end-of-business-day process, daily, post-midnight, and month-end activities. We observed DMS migration tasks were out of sync or failed during this time. We reviewed the source, target, and replication instance metrics in the logs and found the following observations: CDCLatencySource – the gap, in seconds, between the last event captured from the source endpoint and the current system timestamp of the AWS DMS instance.CDCIncomingchanges – the total number of change events at a point in time that is waiting to be applied to the target. This increases from zero to thousands during reconciliation activities in the early morning.CDCLatencySource – the gap, in seconds, between the last event captured from the source endpoint and the current system timestamp of the AWS DMS instance. This increases from zero to a few thousand up to 10-12K seconds during daily post-midnight reconciliation activities. This value was up to 40K during month-end activities. Upon further logs analysis and reviewing other metrics, we observed that: AWS DMS metrics NetworkReceiveThroughput is to understand the incoming traffic on the DMS Replication instance for both customer database and DMS traffic. These metrics help to understand the network-related issues, if any, between the source database and the DMS replication instance. It was observed that the network receive throughput was up to 30MB/s, i.e., 250Mb/s, due to the VPN connection between the source and AWS, which was also shared for other applications. The final conclusion to this issue is that connectivity between source and target databases is critical for successful data migration. You should ensure sufficient bandwidth between on-premises or other cloud source databases and the AWS environment is set up before the actual data migration. A VPN tunnel such as AWS Site-to-Site VPN or Oracle Cloud Infrastructure (OCI) Site-to-Site VPN (Oracle AWS) can provide a throughput of up to 1.25 Gbps. This would be sufficient for small tables migration or tables with less DML traffic migration.For large data migrations with heavy transactions per second on the tables, you should consider AWS Direct Connect. It provides an option to create a dedicated private connection with 1 Gbps, 10 Gbps, etc. bandwidth supported. Conclusion This is Part I of the multi-part series for the relational databases migration challenges using AWS DMS and their solutions implemented. Most of these challenges mentioned in this series could happen during the database migration process and these solutions can be referred.
Microsoft CEO Satya Nadella recently announced that GitHub Copilot is now free for all developers in VSCode. This is a game-changer in the software development industry. Github Copilot is an AI code assistant that helps developers finish their coding tasks easily and quickly. It also helps suggest code snippets and autocomplete functions. In this article, we will learn how to use GitHub Copilot using VSCode in a step-by-step manner for creating the first stateless flask microservice. This is a beginner-friendly guide showcasing how Copilot helps reduce the development time and simplify the process. Setting Up the Environment Locally As our primary focus will be on GitHub Copilot, I will write a high level on the software installation needed. If any installation issues are seen, it is expected that readers would have to solve them locally or comment in this article, where I can try to help. 1. Install Visual Studio Code on Mac or Windows from VSCode (In my examples, I used Mac). 2. Install GitHub Copilot extension in VSCode: Open VSCode and navigate to the Extensions view on the left, as per the below screenshot. Search for "copilot," and GitHub Copilot will appear. Click install. With this step, the Copilot extension is added to VSCode. 3. Activate the Copilot: If you do not have a GitHub account, please create one in GitHub.Back to VSCode, after installing Copilot, we can see in the Welcome tab that it will ask to sign up. Sign up using a GitHub account. Click "Chat with Copilot," and you will see the right side of VSCode, Copilot appears. Click "Chat with Copilot." We will see that the Copilot chat appears on the right-hand side of the VSCode palate. 4. Install Python in your system from Python based on Windows/Mac. Note that we are not installing Flask now; we will do it in a later step while installing the application to run. Writing the Microservice Using CoPilot 1. In VSCode, on the right side with the Copilot pallet, under "Ask Copilot," type: Create a Flask app. There are two ways we can ask Copilot. One is to create the Flask project folder with files and ask Copilot to add the code. Or, start from nothing and ask to create a Flask app. We notice that it will create a workspace for us along with all file creation, which is awesome, and the project gets created with the required files within a few seconds. Click -> Create Workspace -> give the location to save the project. The project will appear in VSCode. 2. We see that the project-created files will have routes.py, where a few default APIs are already generated. Now, we will create 2 APIs using Copilot. The first API is simple and used to greet a person. It takes the name of the person as input and out as "Hello, {name}." Open the routes.py file and add a comment as below: As soon as we hit enter, we see the code generated. Now press the tab, and we will see that the API code is generated. That's the advantage of using Copilot. Similarly, let's create another simple API that takes two integer values as input and returns the multiplication by using Copilot. This time we will try it in the right pallet of VSCode rather than in the routes.py file. Python # Create an endpoint to multiply two numbers. @main.route('/multiply') def multiply(): try: num1 = float(request.args.get('num1')) num2 = float(request.args.get('num2')) result = num1 * num2 return f'The result of {num1} * {num2} is {result}' except (TypeError, ValueError): return 'Invalid input. Please provide two numbers as query parameters.' However, I see a different code was generated when I asked Copilot to write the API inside the routes.py file. See below: Python # Create an endpoint to multiply two numbers. @main.route('/multiply/<int:num1>/<int:num2>') def multiply(num1, num2): return f'{num1} * {num2} = {num1 * num2}' The reason here is based on the previous context it generates the code. When we were on the routes.py file and asked the Copilot to generate the API code, it generated based on the context that the API should have two inputs and return the output. But when we requested to generate in the right palate, it generated based on the previous question with the context that it's a flak app, and input will come from the request param. So here, we can safely conclude that based on previous context, it will generate the next output. Now, both our APIs are ready, so let's deploy the app and test it. But we have not installed Flask yet. So, let's do that. 1. Activate the virtual environment and install Flask. Plain Text source venv/bin/activate # On Linux/Mac venv\Scripts\activate # On Windows pip install flask When we run the application, we see an issue in the startup due to the generated code. Below is the error: Plain Text File "/Users/sibasispadhi/Documents/coding/my-flask-app/venv/lib/python3.12/site-packages/flask/cli.py", line 72, in find_best_app app = app_factory() ^^^^^^^^^^^^^ File "/Users/sibasispadhi/Documents/coding/my-flask-app/app/__init__.py", line 14, in create_app app.register_blueprint(routes.bp) ^^^^^^^^^ AttributeError: module 'app.routes' has no attribute 'bp' (venv) sibasispadhi@Sibasiss-Air my-flask-app % The create_app function in our project's app/__init__.py file is calling app.register_blueprint(routes.bp), but the routes.py file doesn’t have bp (Blueprint object) defined. Below are the changes done to fix the problem. (See the code commented is the one autogenerated). Python # Register blueprints from . import routes # app.register_blueprint(routes.bp) app.register_blueprint(routes.main) Re-running the application will successfully deploy the application, and we are ready to test the functionality. The APIs can be tested using Postman. 2. Testing through Postman gives the results successfully. Conclusion GitHub Copilot generates the project and the boilerplate code seamlessly and it saves development time and effort. It's always advised to review the generated code so that it matches developers' expectations. Whenever there is an error, we must debug or request Copilot further suggestions to solve the problem. In this project, Copilot helped us create and run a stateless Flask microservice in no time. We faced some initial hiccups, which were solved after debugging, but overall, the development time was faster. I would suggest all readers start exploring Copilot today and enhance their day-to-day productivity. Stay tuned for my next set of articles on Copilot, where we will dive deep into more real-world scenarios and see how it solves our day-to-day tasks in a smooth manner.
As the Trump administration revokes Executive Order 14110, the U.S. shifts toward a market-driven AI strategy, departing from the Biden administration’s regulatory framework. While proponents see this as a catalyst for innovation and economic growth, critics warn of increased risks, regulatory fragmentation, and strained transatlantic relations. With Europe reinforcing its AI Act and states like California exploring their own regulations, the future of AI governance in the U.S. remains uncertain. Will deregulation accelerate progress, or does it open the door to new challenges in ethics, security, and global cooperation? Just days after taking office, Donald Trump, the 47th President of the United States, issued a series of executive actions aimed at dismantling key initiatives from the Biden administration. Among them was the revocation of Executive Order (EO) 14110, a landmark policy that established a framework for AI governance and regulation. This decision marks a turning point in U.S. AI policy. For its supporters, it is a necessary reform; for its critics, it is a dangerous setback. While EO 14110 aimed to structure AI adoption by balancing innovation and oversight, its repeal raises critical questions about the future of AI in the United States and its global impact. Background on Executive Order 14110 Executive Order 14110 was issued on October 30, 2023, under the Biden administration. This major initiative aimed to regulate the development and deployment of artificial intelligence. Its goal was to balance innovation, security, and economic stability while ensuring that AI systems remained reliable, safe, and transparent. In the Biden administration’s vision, EO 14110 was designed to address key concerns such as algorithmic bias, misinformation, job displacement, and cybersecurity vulnerabilities. It was not intended to impose direct restrictions on the private sector but rather to establish security and ethical standards, particularly for AI used by federal agencies and in public sector contracts, while also influencing broader AI governance. From an international perspective, EO 14110 also aimed to strengthen the United States' role in global AI governance. It aligned with the European Union’s approach, particularly as the EU was developing its AI Act. The order was part of a broader transatlantic effort to establish ethical and security standards for AI. "Artificial Intelligence (AI) holds extraordinary potential for both promise and peril. Responsible AI use has the potential to help solve urgent challenges while making our world more prosperous, productive, innovative, and secure. At the same time, irresponsible use could exacerbate societal harms such as fraud, discrimination, bias, and disinformation; displace and disempower workers; stifle competition; and pose risks to national security." (EO 14110 - Section 1) EO 14110 as Part of a Broader AI Strategy: Continuity in Biden’s Policy It is important to understand that EO 14110 was not an isolated initiative. It was part of a broader strategy built on several existing frameworks and commitments. Blueprint for an AI Bill of Rights (2022). A foundational document outlining five key principles: safe and effective AI systems, protections against algorithmic discrimination, data privacy, transparency, and human alternatives. Voluntary AI Commitments (2023-2024). Major tech companies, including Google, OpenAI, and Microsoft, agreed to self-regulation measures focusing on AI transparency, security, and ethics. National Security AI Strategy (2024). The Biden administration made AI a priority in cybersecurity, military applications, and critical infrastructure protection. It is worth noting that even after the revocation of EO 14110, these initiatives remain in place, ensuring a degree of continuity in AI governance in the United States. Objectives and Scope of EO 14110 Executive Order 14110 pursued several strategic objectives aimed at regulating AI adoption while promoting innovation. It emphasized the security and reliability of AI systems by requiring robustness testing and risk assessments, particularly in sensitive areas such as cybersecurity and critical infrastructure. It also aimed to ensure fairness and combat bias by implementing protections against algorithmic discrimination and promoting ethical AI use in hiring, healthcare, and justice. EO 14110 included training, reskilling, and protection programs to help workers adapt to AI-driven changes. It also aimed to protect consumers by preventing fraudulent or harmful AI applications, ensuring safe and beneficial use. Finally, the executive order aimed to reinforce international cooperation, particularly with the European Union, to establish common AI governance standards. However, it’s important to note that it did not aim to regulate the entire private sector but rather to set strict ethical and security standards for AI systems used by federal agencies. Summary of EO 14110’s Key Principles To quickly get the essentials, here are the eight fundamental principles it was built on: Ensure AI is safe and secure.Promote responsible innovation and competition.Support workers affected by AI’s deployment.Advance equity and civil rights.Protect consumer interests in AI applications.Safeguard privacy and civil liberties.Enhance AI capabilities within the federal government.Promote U.S. leadership in global AI governance. Why Did the Trump Administration Revoke EO 14110? So, on January 20, 2025, the Trump administration announced the revocation of EO 14110, arguing that it restricted innovation by imposing excessive administrative constraints. The White House justified this decision as part of a broader push to deregulate the sector, boost the economy, and attract AI investment. The administration made clear its preference for a market-driven approach. According to Trump, private companies are better positioned to oversee AI development without federal intervention. Clearly, this shift marks a geopolitical turning point. The United States is moving away from a multilateral approach to assert its dominance in the AI sector. However, this revocation does not mean the end of AI regulation in the United States. Other federal initiatives, such as the NIST AI Risk Management Framework, remain in place. "Republicans support AI development rooted in free speech and human flourishing." (The 2024 Republican Party by Reuters) Consequences of the Revocation in the United States The repeal of EO 14110 has immediate effects and long-term implications. It reshapes the future of AI development in the United States. From the Trump administration’s perspective, this decision removes bureaucratic hurdles, accelerates innovation, and strengthens U.S. competitiveness in AI. Supporters argue that by reducing regulatory constraints, the repeal allows companies to move faster, lowers compliance costs, and attracts greater investment, particularly in automation and biotechnology. But on the other hand, without a federal framework, the risks associated with the development and use of AI technologies are increasing. Algorithmic bias, cybersecurity vulnerabilities, and the potential misuse of AI become harder to control without national oversight. Critics also warn of a weakening of worker and consumer protections, as the end of support programs could further deepen economic inequalities. In practical terms, regulation is becoming more fragmented. Without a federal framework, each state could, and likely will, develop its own AI laws, making compliance more complex for businesses operating nationwide. Some see this as an opportunity for regulatory experimentation, while others see it as a chance for opportunistic players to exploit loopholes or fear legal uncertainty and increased tensions with international partners. Impact on Europe The revocation of EO 14110 also affects global AI governance, particularly in Europe. Transatlantic relations are likely to become strained, as the growing divergence between U.S. and European approaches will make regulatory cooperation more challenging. European companies may tighten their compliance standards to maintain consumer trust, which could influence their strategic decisions. In fact, the European Union may face pressure to adjust its AI Act, although its regulatory framework remains largely independent from that of the United States. Conclusion The revocation of Executive Order 14110 is more than just a policy shift in the United States. It represents a strategic choice, favoring a deregulated model where innovation takes precedence over regulation. While this decision may help accelerate technological progress, it also leaves critical questions unanswered: Who will ensure the ethics, security, and transparency of AI in the United States? For Europe, this shift deepens the divide with the U.S. and strengthens its role as a "global regulator" through the AI Act. The European Union may find itself alone at the forefront of efforts to enforce strict AI regulations, risking a scenario where some companies favor the less restrictive U.S. market. More than a debate on regulation, this revocation raises a fundamental question: In the global AI race, should progress be pursued at all costs, or should every advancement be built on solid and ethical foundations? The choices made today will shape not only the future of the industry but also the role of democracies in the face of tech giants. One More Thing The revocation of EO 14110 highlights a broader debate: who really shapes AI policy, the government or private interests? While the U.S. moves toward deregulation, California’s AI safety bill (SB 1047) is taking the opposite approach, proposing strict oversight for advanced AI models. But as an investigation by Pirate Wires reveals, this push for regulation isn’t without controversy. Dan Hendrycks, a key advocate for AI safety, co-founded Gray Swan, a company developing compliance tools that could directly benefit from SB 1047’s mandates. This raises a crucial question: When policymakers and industry leaders are deeply intertwined, is AI regulation truly about safety, or about controlling the market? In the race to govern AI, transparency may be just as important as regulation itself.
Neon is now available on the Azure marketplace. The new integration between Neon and Azure allows you to manage your Neon subscription and billing through the Azure portal as if Neon were an Azure product. Azure serverless and Neon are a natural combination — Azure serverless frees you from managing your web server infrastructure. Neon does the same for databases, offering additional features like data branching and vector database extensions. That said, let's try out this new integration by building a URL shortener API with Neon, Azure serverless, and Node.js. Note: You should have access to a terminal, an editor like VS Code, and Node v22 or later installed. Setting Up the Infrastructure We are going to have to do things a little backward today. Instead of writing the code, we will first first set up our serverless function and database. Step 1. Open up the Azure web portal. If you don’t already have one, you will need to create a Microsoft account. Step 2. You will also have to create a subscription if you don’t have one already, which you can do in Azure. Step 3. Now, we can create a resource group to store our serverless function and database. Go to Azure's new resource group page and fill out the form like this: This is the Azure Resource Group creation page with the resource group set to "AzureNeonURLShortener" and the location set to West US 3.In general, use the location closest to you and your users, as the location will determine the default placement of serverless functions and what areas have the lowest latency. It isn’t vital in this example, but you can search through the dropdown if you want to use another. However, note that Neon doesn’t have locations in all of these regions yet, meaning you would have to place your database in a region further from your serverless function. Step 4. Click "Review & Create" at the bottom to access a configuration review page. Then click "Create" again. Step 5. Now, we can create a serverless function. Unfortunately, it includes another form. Go to the Azure Flex consumption serverless app creation page and complete the form. Use the resource group previously created, choose a unique serverless function name, place the function in your resource group region, and use Node v20. Step 6. The name you choose for your serverless app will be the subdomain Azure gives you to access your API, so choose wisely. After you finish filling everything out, click "Review and Create" and finally, "Create." Azure should redirect you to the new app's page. Now we can set up Neon. Open the new Neon Resource page on the Azure portal, and, you guessed it, fill out the form. How to Create a Neon Database on Azure Step 1. Create a new Neon resource page with "AzureURLNeonShortener" as the resource group, "URLShortenerDB" as the resource name, and "West US 3" as the location. If the area you chose isn’t available, choose the next closest region. Once you complete everything, click "Review & Create" and then "Create," as you did for previous resources. Step 2. You might have to wait a bit for the Neon database to instantiate. Once it does, open its configuration page and click "Go To Neon." Step 3. You will be redirected to a login page. Allow Neon to access your Azure information, and then you should find yourself on a project creation page. Fill out the form below: The project and database name aren't significant, but make sure to locate the database in Azure West US 3 (or whatever region you choose). This will prevent database queries from leaving the data center, decreasing latency. Step 4. Click "Create" at the bottom of the page, keeping the default autoscaling configuration. You should now be redirected to a Neon database page. This page has our connection string, which we will need to connect to our database from our code. Click "Copy snippet" to copy the connection string. Make sure you don’t lose this, as we will need it later, but for now, we need to structure our database. Step 5. Click “SQL Editor” on the side navigation, and paste the following SQL in: SQL CREATE TABLE IF NOT EXISTS urls(id char(12) PRIMARY KEY, url TEXT NOT NULL); Then click "Run." This will create the table we will use to store URLs. The table is pretty simple: The primary key ID is a 12 — character random string that we will use to refer to URLs, and the URL is a variable-length string that will store the URL itself. Step 6. If you look at the Table view on the side navigation, you should see a “urls” table. Finally, we need to get our connection string. Click on “Dashboard” on the side nav, find the connection string, and click “Copy snippet.” Now, we can start writing code. Building the API Step 1. First, we must install Azure’s serverless CLI, which will help us create a project and eventually test/publish it. Open a terminal and run: Plain Text npm install -g azure-functions-core-tools --unsafe-perm true Step 2. If you want to use other package managers like Yarn or pnpm, just replace npm with your preferred package manager. Now, we can start on our actual project. Open the folder you want the project to be in and run the following three commands: Plain Text func init --javascript func new --name submit --template "HTTP trigger" func new --name url --template "HTTP trigger" npm install nanoid @neondatabase/serverless Now, you should see a new Azure project in that folder. The first command creates the project, the two following commands create our serverless API routes, and the final command installs the Neon serverless driver for interfacing with our database and Nano ID for generating IDs. We could use a standard Postgres driver instead of the Neon driver, but Neon’s driver uses stateless HTTP queries to reduce latency for one-off queries. Because we are running a serverless function that might only process one request and send one query, one-off query latency is important. You will want to focus on the code in src/functions, as that is where our routes are. You should see two files there: submit.js and redirect.js. submit.js submit.js will store the code we use to submit URLs. First, open submit.js and replace its code with the following: TypeScript import { app } from "@azure/functions"; import { neon } from "@neondatabase/serverless"; import { nanoid } from "nanoid"; const sql = neon("[YOUR_POSTGRES_CONNECTION_STRING]"); app.http("submit", { methods: ["GET"], authLevel: "anonymous", route: "submit", handler: async (request, context) => { if (!request.query.get("url")) return { body: "No url provided", status: 400, }; if (!URL.canParse(request.query.get("url"))) return { body: "Error parsing url", status: 400, }; const id = nanoid(12); await sql`INSERT INTO urls(id,url) VALUES (${id},${request.query.get( "url" )})`; return new Response(`Shortened url created with id ${id}`); }, }); Let’s break this down step by step. First, we import the Azure functions API, Neon serverless driver, and Nano ID. We are using ESM (ES Modules) here instead of CommonJS. We will need to make a few changes later on to support this. Next, we create the connection to our database. Replace [YOUR_POSTGRES_CONNECTION_STRING] with the string you copied from the Neon dashboard. For security reasons, you would likely want to use a service like Azure Key Vault to manage your keys in a production environment, but for now, just placing them in the script will do. Now, we are at the actual route. The first few properties define when our route handler should be triggered: We want this route to be triggered by a GET request to submit. Our handler is pretty simple. We first check if a URL has been passed through the URL query parameter (e.g., /submit?url=https://google.com), then we check whether it is a valid URL via the new URL.canParse API. Next, We generate the ID with Nano ID. Because our IDs are 12 characters long, we have to pass 12 to the Nano ID generator. Finally, we insert a new row with the new ID and URL into our database. The Neon serverless driver automatically parameterizes queries, so we don’t need to worry about malicious users passing SQL statements into the URL. redirect.js redirect.js is where our actual URL redirects will happen. Replace its code with the following: TypeScript import { app } from "@azure/functions"; import { neon } from "@neondatabase/serverless"; const sql = neon("[YOUR_POSTGRES_CONNECTION_STRING]"); app.http("redirect", { methods: ["GET"], authLevel: "anonymous", route: "{id:length(12)}", handler: async (request, context) => { const url = await sql`SELECT * FROM urls WHERE urls.id=${request.params.id}`; if (!url[0]) return new Response("No redirect found", { status: 404 }); return Response.redirect(url[0].url, 308); }, }); The first section of the script is the same as submit.js. Once again, replace it \[YOUR\_POSTGRES\_CONNECTION\_STRING\] with the string you copied from the Neon dashboard. The route is where things get more interesting. We need to accept any path that could be a redirect ID, so we use a parameter with the constraint of 12 characters long. Note that this could overlap if you ever have another 12-character route. If it does, you can rename the redirect route to start with a Z or other alphanumerically greater character to make Azure serverless load the redirect route after. Finally, we have our actual handler code. All we need to do here is query for a URL matching the given ID and redirect to it if one exists. We use the 308 status code in our redirect to tell browsers and search engines to ignore the original shortened URL. Config Files There are two more changes we need to make. First, we don’t want a /api prefix on all our functions. To remove this, open host.json, which should be in your project directory, and add the following: TypeScript "extensions": { "http": { "routePrefix": "" } } This allows your routes to operate without any prefixes. The one other thing we need to do is switch the project to ES Modules. Open package.json and insert the following at the end of the file: Plain Text "type": "module" That’s it! Testing and Deploying Now, you can try testing locally by running func start. You can navigate to http://localhost:7071/submit?url=https://example.com, then use the ID it gives you and navigate to http://localhost:7071/[YOUR_ID]. You should be redirected to example.com. Of course, we can’t just run this locally. To deploy, we need to install the Azure CLI, which you can do with one of the following commands, depending on your operating system: macOS (Homebrew) Plain Text brew install azure-cli Windows (WPM) Plain Text winget install -e --id Microsoft.AzureCLI Linux Plain Text curl -L <https://aka.ms/InstallAzureCli> | bash Now, restart the terminal, log in by running az login, and run the following in the project directory: Plain Text func azure functionapp publish [FunctionAppName] Replace [FunctionAppName] with whatever you named your function earlier. Now, you should be able to access your API at [FunctionAppName].azurewebsites.net. Conclusion You should now have a fully functional URL Shortener. You can access the code here and work on adding a front end. If you want to keep reading about Neon and Azure’s features, we recommend checking out Branching. Either way, I hope you learned something valuable from this guide.
Overview One of the key principles of writing a good data pipeline is ensuring accurate data is loaded into the target table. We have no control over the quality of the upstream data we read from, but we can have a few data quality (DQ) checks in our pipeline to ensure any bad data would be caught early on without letting it propagate downstream. DQ checks are critical in making sure the data that gets processed every day is reliable, and that downstream tables can query them safely. This will save a lot of time and resources, as we will be able to halt the data flow, giving us some time to do RCA and fix the issue rather than pass incorrect data. The biggest challenge with large data warehouses with multiple interdependent pipelines is that we would have no idea about the data issue if bad data gets introduced in one of the pipelines, and sometimes, it could take days, even before it's detected. Even though DQ check failures could cause some temporary delay in landing the data, it's much better than customers or users reporting data quality issues and then having to backfill all the impacted tables. Some of the common data quality issues that could occur are: Duplicate rows – a table at user grain (which means there can only be one row per user), having duplicates0 or null values – you expect certain critical columns not to have any null or 0 values, e.g., SSN, age, country columnsAbnormal row count – the overall row count of the table suddenly increases or drops compared to the historical valuesAbnormal metric value – a specific metric, say '# of daily user logins' suddenly spikes or drops compared to historical values Note: The operators we will be referencing below are part of the Dataswarm and Presto tech stack, which are a proprietary data pipeline building tool and an SQL query engine, respectively, developed at Facebook. Importance of Signal Tables It's a good practice to publish signal tables, which should serve as the source for downstream pipelines. These are essentially linked views that can be created on top of any table. Since they are views, they don’t take up any storage, so there is no reason not to build them. These should be created only after the DQ checks pass, and downstream pipelines should wait for these signal tables to be available rather than waiting for the source table directly, as these would have been vetted for any data anomalies. Building the Right DAG In the data lineage flow below, if bad data gets loaded into table1, then without DQ checks, they would get passed on to table2 and table3 as there is no way for pipelines2 and 3 to know of any data issues, as all they do is simply check if the table1 data has landed. But if DQ checks had been implemented, then it would fail the job/pipeline, and the table1_sigal wouldn’t have been created; thus, the downstream WaitForOperators would still be waiting, stopping the propagation of bad data. Types of DQ Failures to Enforce Hard failure. If these DQ checks fail, the job will fail and notify the oncall or table owner, so the signal table will not be created. These could potentially cause downstream pipelines to be delayed and could be an issue if they have tighter Service Level Agreements (SLAs). But for critical pipelines, this might be worth it, as sending bad data could have catastrophic ripple effects.Soft failure. If these fail, the oncall and table owner would be notified, but the job won't fail, so the signal table would still get published, and the data would get loaded and propagated downstream. For cases where the data quality loss is tolerable, this can be used. Setting Up DQ Checks We will go over some examples of how we can set up the different DQ checks and some simplified trigger logic behind each of the DQ operators. Some things to know beforehand: '<DATEID>' is a macro that will resolve to the date the Dataswarm pipeline is scheduled to run (e.g., when the job runs on Oct 1, 2020, it will resolve to ds = '2020-10-01').The output of presto_api will be an array of dictionaries, e.g., [{'ds': '2020-10-01', 'userID': 123, ‘country’: ‘US’}, {'ds': '2020-10-01', 'userID': 124, ‘country’: ‘CA’}, {...}], where each dictionary value represents the corresponding row values of the table being queried, and the key is the column name. Below would be the table representation of the data, Duplicate Rows We can simply aggregate by the key column (e.g., userID) specified by the user and check if there are any duplicate rows present by peforming a simple GROUP BY with a HAVING clause, and limiting to just 1 row. The presto_results variable should be empty ([]); if not, then there are duplicates present in the table. Python # output will be an array of dict representing reach row in the table # eg [{'ds': '2020-10-01', 'userID': 123}, {...}] presto_results = presto_api( namespace = 'namespace_name', sql = ''' SELECT useriID FROM table WHERE ds = '<DATEID>' GROUP BY 1 HAVING SUM(1) > 1 LIMIT 1 ''' ) if len(presto_results) > 0: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS 0 or Null Values We can check if any of the specified columns have any invalid values by leveraging count_if presto UDF. Here, the output, if there are no invalid values, should be [{'userid_null_count': 0}]. Python presto_results = presto_api( namespace = 'namespace_name', sql = ''' SELECT COUNT_IF( userid IS NULL OR userid = 0 ) AS userid_null_count FROM table WHERE ds = '<DATEID>' ''' ) if presto_results[0]['userid_null_count'] > 0: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS Abnormal Row Count To get a sense of what the normal/expected row count is for a table on a daily basis, we can do a simple 7-day average of the previous 7 days, and if today's value deviates too much from that, we can trigger the alert. The thresholds can be either: Static – a fixed upper and lower threshold that is always static. Every day, the operator checks if today’s row count is either over or below the thresholds.Dynamic – use a +x% and -x% threshold value (you can start with, say, 15%, and adjust as needed), and if today's value is greater than the 7d avg + x% or lower than the 7d avg - x%, then trigger the alert. Python dq_insert_operator = PrestoInsertOperator( input_data = {"in": "source_table"}, output_data = {"out": "dq_check_output_table"}, select = """ SELECT SUM(1) AS row_count FROM source_table WHERE ds = '<DATEID>' """, ) dq_row_check_result = presto_api( namespace = 'namespace_name', sql = ''' SELECT ds, row_count FROM dq_check_output_table WHERE ds >= '<DATEID-7>' ORDER BY 1 ''' ) # we will loop through the dq_row_check_result object, which will have 8 values # where we will find the average between DATEID-7 and DATEID-1 and compare against DATEID x = .15 # threshold prev_7d_list = dq_row_check_result[0:7] prev_7d_sum = sum([prev_data['row_count'] for prev_data in prev_7d_list]) prev_7d_avg = prev_7d_sum/7 today_value = dq_row_check_result[-1]['row_count'] upper_threshold = prev_7d_avg * (1 + x) lower_threshold = prev_7d_avg * (1 - x) if today_value > upper_threshold or today_value < lower_threshold: # NOTIFY oncall/owner # JOB KILLED else: # JOB SUCCESS So, every day, we calculate the sum of the total row count and load it into a dq_check_output_table (a temporary intermediate table that is specially used for storing DQ aggregated results). Then, we query the last 7 days and today's data from that table and store the values in an object, which we then loop through to calculate the upper and lower thresholds and check if today's value is violating either of them. Abnormal Metric Value If there are specific metrics that you want to track to see if there are any anomalies, you can set them up similarly to the above 'abnormal row count' check. Python dq_insert_operator = PrestoInsertOperator( input_data={"in": "source_table"}, output_data={"out": "dq_check_output_table"}, select=""" SELECT APPROX_DISTINCT(userid) AS distinct_user_count, SUM(cost) AS total_cost, COUNT_IF(has_login = True) AS total_logins FROM source_table WHERE ds = '<DATEID>' """, ) dq_row_check_result = presto_api( namespace='namespace_name', sql=''' SELECT ds, distinct_user_count, total_cost, total_logins FROM table WHERE ds >= '<DATEID-7>' ORDER BY 1 ''' ) Here, we calculate the distinct_user_count, total_cost, and total_logins metric and load it into a dq_check_output_table table, which we will query to find the anomalies. Takeaways You can extend this to any kind of custom checks/alerts like month-over-month value changes, year-over-year changes, etc. You can also specify GROUP BY clauses, for example, track the metric value at the interface or country level over a period of time. You can set up a DQ check tracking dashboard, especially for important metrics, to see how they have been behaving over time. In the screenshot below, you can see that there have been DQ failures for two of the dates in the past, while for other days, it has been within the predefined range. This can also be used to get a sense of how stable the upstream data quality is. They can save a lot of time as developers would be able to catch issues early on and also figure out where in the lineage the issue is occurring.Sometimes, the alerts could be false positive (FP) (alerts generated not due to bad/incorrect data, but maybe due to seasonality/new product launch, there could be a genuine volume increase or decrease). We need to ensure such edge cases are handled correctly to avoid noisy alerts. There is nothing worse than oncall being bombarded with FP alerts, so we want to be mindful of the thresholds we set and tune them as needed periodically.
When it comes to managing infrastructure in the cloud, AWS provides several powerful tools that help automate the creation and management of resources. One of the most effective ways to handle deployments is through AWS CloudFormation. It allows you to define your infrastructure in a declarative way, making it easy to automate the provisioning of AWS services, including Elastic Beanstalk, serverless applications, EC2 instances, security groups, load balancers, and more. In this guide, we'll explore how to use AWS CloudFormation to deploy infrastructure programmatically. We'll also cover how to manually deploy resources via the AWS Management Console and how to integrate services like Elastic Beanstalk, serverless functions, EC2, IAM, and other AWS resources into your automated workflow. Using AWS CloudFormation for Infrastructure as Code AWS CloudFormation allows you to define your infrastructure using code. CloudFormation provides a unified framework to automate and version your infrastructure by setting up Elastic Beanstalk, EC2 instances, VPCs, IAM roles, Lambda functions, or serverless applications. CloudFormation templates are written in YAML or JSON format, and they define the resources you need to provision. With CloudFormation, you can automate everything from simple applications to complex, multi-service environments. Key Features of CloudFormation Declarative configuration. Describe the desired state of your infrastructure, and CloudFormation ensures that the current state matches it.Resource management. Automatically provisions and manages AWS resources such as EC2 instances, RDS databases, VPCs, Lambda functions, IAM roles, and more.Declarative stack updates. If you need to modify your infrastructure, simply update the CloudFormation template, and it will adjust your resources to the new desired state. Steps to Use CloudFormation for Various AWS Deployments Elastic Beanstalk Deployment With CloudFormation 1. Write a CloudFormation Template Create a YAML or JSON CloudFormation template to define your Elastic Beanstalk application and environment. This template can include resources like EC2 instances, security groups, scaling policies, and even the Elastic Beanstalk application itself. Example of CloudFormation Template (Elastic Beanstalk): YAML yaml Resources: MyElasticBeanstalkApplication: Type: 'AWS::ElasticBeanstalk::Application' Properties: ApplicationName: "my-application" Description: "Elastic Beanstalk Application for my React and Spring Boot app" MyElasticBeanstalkEnvironment: Type: 'AWS::ElasticBeanstalk::Environment' Properties: EnvironmentName: "my-app-env" ApplicationName: !Ref MyElasticBeanstalkApplication SolutionStackName: "64bit Amazon Linux 2 v3.4.9 running Docker" OptionSettings: - Namespace: "aws:autoscaling:asg" OptionName: "MaxSize" Value: "3" - Namespace: "aws:autoscaling:asg" OptionName: "MinSize" Value: "2" - Namespace: "aws:ec2:vpc" OptionName: "VPCId" Value: "vpc-xxxxxxx" - Namespace: "aws:ec2:vpc" OptionName: "Subnets" Value: "subnet-xxxxxxx,subnet-yyyyyyy" 2. Deploy the CloudFormation Stack Use the AWS CLI or AWS Management Console to deploy the CloudFormation stack. Once deployed, CloudFormation will automatically create all the resources defined in the template. Deploy via AWS CLI: YAML bash aws cloudformation create-stack --stack-name MyElasticBeanstalkStack --template-body file://my-template.yml Serverless Deployment With AWS Lambda, API Gateway, and DynamoDB CloudFormation is also great for deploying serverless applications. With services like AWS Lambda, API Gateway, DynamoDB, and S3, you can easily manage serverless workloads. 1. Create a Serverless CloudFormation Template This template will include a Lambda function, an API Gateway for accessing the function, and a DynamoDB table. Example of CloudFormation Template (Serverless): YAML yaml Resources: MyLambdaFunction: Type: 'AWS::Lambda::Function' Properties: FunctionName: "MyServerlessFunction" Handler: "index.handler" Role: arn:aws:iam::123456789012:role/lambda-execution-role Code: S3Bucket: "my-serverless-code-bucket" S3Key: "function-code.zip" Runtime: nodejs14.x MyAPIGateway: Type: 'AWS::ApiGateway::RestApi' Properties: Name: "MyAPI" Description: "API Gateway for My Serverless Application" MyDynamoDBTable: Type: 'AWS::DynamoDB::Table' Properties: TableName: "MyTable" AttributeDefinitions: - AttributeName: "id" AttributeType: "S" KeySchema: - AttributeName: "id" KeyType: "HASH" ProvisionedThroughput: ReadCapacityUnits: 5 WriteCapacityUnits: 5 2. Deploy the Serverless Stack Deploy your serverless application using the AWS CLI or AWS Management Console. YAML bash aws cloudformation create-stack --stack-name MyServerlessStack --template-body file://serverless-template.yml VPC and EC2 Deployment CloudFormation can automate the creation of a Virtual Private Cloud (VPC), subnets, security groups, and EC2 instances for more traditional workloads. 1. CloudFormation Template for VPC and EC2 This template defines a simple EC2 instance within a VPC, with a security group allowing HTTP traffic. Example of CloudFormation Template (VPC and EC2): YAML Resources: MyVPC: Type: 'AWS::EC2::VPC' Properties: CidrBlock: "10.0.0.0/16" EnableDnsSupport: "true" EnableDnsHostnames: "true" MySecurityGroup: Type: 'AWS::EC2::SecurityGroup' Properties: GroupDescription: "Allow HTTP and SSH traffic" SecurityGroupIngress: - IpProtocol: "tcp" FromPort: "80" ToPort: "80" CidrIp: "0.0.0.0/0" - IpProtocol: "tcp" FromPort: "22" ToPort: "22" CidrIp: "0.0.0.0/0" MyEC2Instance: Type: 'AWS::EC2::Instance' Properties: InstanceType: "t2.micro" ImageId: "ami-xxxxxxxx" SecurityGroupIds: - !Ref MySecurityGroup SubnetId: !Ref MyVPC 2. Deploy the Stack YAML aws cloudformation create-stack --stack-name MyEC2Stack --template-body file://vpc-ec2-template.yml Advanced Features of CloudFormation AWS CloudFormation offers more than just simple resource provisioning. Here are some of the advanced features that make CloudFormation a powerful tool for infrastructure automation: Stack Sets. Create and manage stacks across multiple AWS accounts and regions, allowing for consistent deployment of infrastructure across your organization.Change Sets. Before applying changes to your CloudFormation stack, preview the changes with a change set to ensure the desired outcome.Outputs. Output values from CloudFormation that you can use for other stacks or applications. For example, output the URL of an API Gateway or the IP address of an EC2 instance.Parameters. Pass in parameters to customize your stack without modifying the template itself, making it reusable in different environments.Mappings. Create key-value pairs for mapping configuration values, like AWS region-specific values, instance types, or other environment-specific parameters. Using CloudFormation With AWS Services Beyond Elastic Beanstalk CloudFormation isn't just limited to Elastic Beanstalk deployments — it's a flexible tool that can be used with a variety of AWS services, including: AWS Lambda. Automate the deployment of serverless functions along with triggers like API Gateway, S3, or DynamoDB events.Amazon S3. Use CloudFormation to create S3 buckets and manage their configuration.AWS IAM. Automate IAM role and policy creation to control access to your resources.Amazon RDS. Define RDS databases (MySQL, PostgreSQL, etc.) with all associated configurations like VPC settings, subnets, and security groups.Amazon SQS, SNS. Manage queues and topics for your application architecture using CloudFormation.Amazon ECS and EKS. Automate the creation and deployment of containerized applications with services like ECS and EKS. Manually Deploying Infrastructure from the AWS Management Console While CloudFormation automates the process, sometimes manual intervention is necessary. The AWS Management Console allows you to deploy resources manually. 1. Elastic Beanstalk Application Go to the Elastic Beanstalk Console.Click Create Application, follow the steps to define the application name and platform (e.g., Docker, Node.js), and then manually configure the environment, scaling, and security options. 2. Serverless Applications (Lambda + API Gateway) Go to Lambda Console to create and deploy functions.Use API Gateway Console to create APIs for your Lambda functions. 3. EC2 Instances Manually launch EC2 instances from the EC2 Console and configure them with your chosen instance type, security groups, and key pairs. Conclusion AWS CloudFormation provides a consistent and repeatable way to manage infrastructure for Elastic Beanstalk applications, serverless architectures, and EC2-based applications. With its advanced features like Stack Sets, Change Sets, and Parameters, CloudFormation can scale to meet the needs of complex environments. For anyone managing large or dynamic AWS environments, CloudFormation is an essential tool for ensuring consistency, security, and automation across all your AWS deployments.
Keycloak is a powerful authentication and authorization solution that provides plenty of useful features, such as roles and subgroups, an advanced password policy, and single sign-on. It’s also very easy to integrate with other solutions. We’ve already shown you how to connect Keycloak to your Angular app, but there’s more you can do. For example, by integrating this technology with Cypress, you can enable the simulation of real-user login scenarios, including multi-factor authentication and social logins, ensuring that security protocols are correctly implemented and functioning as expected. Most importantly, you can also use Docker containers to provide a portable and consistent environment across different platforms (possibly with container image scanning, for increased security). This integration ensures easy deployment, scalability, and efficient dependency management, streamlining the process of securing applications and services. Additionally, Docker Compose can be used to orchestrate multiple containers, simplifying complex configurations and enhancing the overall management of Keycloak instances. This guide will show you precisely how to set all of this up. Let’s get started! Prerequisites The article is based on the contents of a GitHub repository consisting of several elements: Frontend application written in AngularKeycloak configurationE2E tests written in CypressDocker configuration for the whole stack The point of this tech stack is to allow users to work with Angular/Keycloak/Cypress locally and also in Docker containers. Keycloak Configuration We’ll start by setting up Keycloak, which is a crucial part of both configurations. The idea is to run it inside a Docker container and expose it at http://localhost:8080. Keycloak has predefined configurations, including users, realm, and client ID, so setting it up for this project requires minimum effort. Normal User Your normal user in the Keycloak panel should be configured using the following details: User: testPassword: sIjKqg73MTf9uTU Keycloak Administrator Here’s the default configuration for the admin user (of course, you probably shouldn’t use default settings for the admin account in real-world scenarios). User: adminPassword: admin Local Configuration This configuration allows you to work locally with an Angular application in dev mode along with E2E tests. It requires Keycloak to be run and available on http://localhost:8080. This is set in the Docker configuration, which is partially used here. To run the configuration locally, use the following commands in the command line. First, in the main project directory: JavaScript npm install In /e2e directory: JavaScript npm install In the main directory for frontend application development: JavaScript npm run start In /e2e directory: JavaScript npm run cy:run In the main project directory: JavaScript docker-compose up -d keycloak Docker Configuration Installing and configuring Docker is a relatively simple matter — the solution provides detailed documentation you can use if you run into any problems. In the context of our project, the Docker configuration does several key things: Running Keycloak and importing the predefined realm along with usersBuilding and exposing the Angular application on http://localhost:4200 via nginx on a separate Docker containerRunning e2e container to allow you to run tests via Cypress To run a dockerized configuration, type in the command line in the main project directory: JavaScript docker-compose up -d To run Cypress tests inside the container, use the following command: JavaScript docker container exec -ti e2e bash Then, inside the container, run: JavaScript npm run cy:run Test artifacts are connected to the host machine via volume, so test reports, screenshots, and videos will be available immediately on path /e2e/cypress/ in the following folders: reports, screenshots, and videos. Conclusion And that’s about it. As you can see, integrating Keycloak (or rather an Angular app that uses Keycloak), Docker, and Cypress is a relatively straightforward process. There are only a couple of steps you must take to get a consistent, containerized environment for easy deployment, scaling, and efficient dependency management — with the added benefit of real-user login scenario simulation thanks to Cypress for top-notch security.
DuckDb is a powerful in-memory database that has a parallel processing feature, which makes it a good choice to read/transform cloud storage data, in this case, AWS S3. I've had a lot of success using it and I will walk you through the steps in implementing it. I will also include some learnings and best practices for you. Using the DuckDb, httpfs extension and pyarrow, we can efficiently process Parquet files stored in S3 buckets. Let's dive in: Before starting the installation of DuckDb, make sure you have these prerequisites: Python 3.9 or higher installed Prior knowledge of setting up Python projects and virtual environments or conda environments Installing Dependencies First, let's establish the necessary environment: Shell # Install required packages for cloud integration pip install "duckdb>=0.8.0" pyarrow pandas boto3 requests The dependencies explained: duckdb>=0.8.0: The core database engine that provides SQL functionality and in-memory processingpyarrow: Handles Parquet file operations efficiently with columnar storage supportpandas: Enables powerful data manipulation and analysis capabilitiesboto3: AWS SDK for Python, providing interfaces to AWS servicesrequests: Manages HTTP communications for cloud interactions Configuring Secure Cloud Access Python import duckdb import os # Initialize DuckDB with cloud support conn = duckdb.connect(':memory:') conn.execute("INSTALL httpfs;") conn.execute("LOAD httpfs;") # Secure AWS configuration conn.execute(""" SET s3_region='your-region'; SET s3_access_key_id='your-access-key'; SET s3_secret_access_key='your-secret-key'; """) This initialization code does several important things: Creates a new DuckDB connection in memory using :memory:Installs and loads the HTTP filesystem extension (httpfs) which enables cloud storage accessConfigures AWS credentials with your specific region and access keysSets up a secure connection to AWS services Processing AWS S3 Parquet Files Let's examine a comprehensive example of processing Parquet files with sensitive data masking: Python import duckdb import pandas as pd # Create sample data to demonstrate parquet processing sample_data = pd.DataFrame({ 'name': ['John Smith', 'Jane Doe', 'Bob Wilson', 'Alice Brown'], 'email': ['john.smith@email.com', 'jane.doe@company.com', 'bob@email.net', 'alice.b@org.com'], 'phone': ['123-456-7890', '234-567-8901', '345-678-9012', '456-789-0123'], 'ssn': ['123-45-6789', '234-56-7890', '345-67-8901', '456-78-9012'], 'address': ['123 Main St', '456 Oak Ave', '789 Pine Rd', '321 Elm Dr'], 'salary': [75000, 85000, 65000, 95000] # Non-sensitive data }) This sample data creation helps us demonstrate data masking techniques. We include various types of sensitive information commonly found in real-world datasets: Personal identifiers (name, SSN)Contact information (email, phone, address)Financial data (salary) Now, let's look at the processing function: Python def demonstrate_parquet_processing(): # Create a DuckDB connection conn = duckdb.connect(':memory:') # Save sample data as parquet sample_data.to_parquet('sample_data.parquet') # Define sensitive columns to mask sensitive_cols = ['email', 'phone', 'ssn'] # Process the parquet file with masking query = f""" CREATE TABLE masked_data AS SELECT -- Mask name: keep first letter of first and last name regexp_replace(name, '([A-Z])[a-z]+ ([A-Z])[a-z]+', '\1*** \2***') as name, -- Mask email: hide everything before @ regexp_replace(email, '([a-zA-Z0-9._%+-]+)(@.*)', '****\2') as email, -- Mask phone: show only last 4 digits regexp_replace(phone, '[0-9]{3}-[0-9]{3}-', '***-***-') as phone, -- Mask SSN: show only last 4 digits regexp_replace(ssn, '[0-9]{3}-[0-9]{2}-', '***-**-') as ssn, -- Mask address: show only street type regexp_replace(address, '[0-9]+ [A-Za-z]+ ', '*** ') as address, -- Keep non-sensitive data as is salary FROM read_parquet('sample_data.parquet'); """ Let's break down this processing function: We create a new DuckDB connectionConvert our sample DataFrame to a Parquet fileDefine which columns contain sensitive informationCreate a SQL query that applies different masking patterns: Names: Preserves initials (e.g., "John Smith" → "J*** S***")Emails: Hides local part while keeping domain (e.g., "" → "****@email.com")Phone numbers: Shows only the last four digitsSSNs: Displays only the last four digitsAddresses: Keeps only street typeSalary: Remains unmasked as non-sensitive data The output should look like: Plain Text Original Data: ============= name email phone ssn address salary 0 John Smith john.smith@email.com 123-456-7890 123-45-6789 123 Main St 75000 1 Jane Doe jane.doe@company.com 234-567-8901 234-56-7890 456 Oak Ave 85000 2 Bob Wilson bob@email.net 345-678-9012 345-67-8901 789 Pine Rd 65000 3 Alice Brown alice.b@org.com 456-789-0123 456-78-9012 321 Elm Dr 95000 Masked Data: =========== name email phone ssn address salary 0 J*** S*** ****@email.com ***-***-7890 ***-**-6789 *** St 75000 1 J*** D*** ****@company.com ***-***-8901 ***-**-7890 *** Ave 85000 2 B*** W*** ****@email.net ***-***-9012 ***-**-8901 *** Rd 65000 3 A*** B*** ****@org.com ***-***-0123 ***-**-9012 *** Dr 95000 Now, let's explore different masking patterns with explanations in the comments of the Python code snippets: Email Masking Variations Python # Show first letter only "john.smith@email.com" → "j***@email.com" # Show domain only "john.smith@email.com" → "****@email.com" # Show first and last letter "john.smith@email.com" → "j*********h@email.com" Phone Number Masking Python # Last 4 digits only "123-456-7890" → "***-***-7890" # First 3 digits only "123-456-7890" → "123-***-****" # Middle digits only "123-456-7890" → "***-456-****" Name Masking Python # Initials only "John Smith" → "J.S." # First letter of each word "John Smith" → "J*** S***" # Fixed length masking "John Smith" → "XXXX XXXXX" Efficient Partitioned Data Processing When dealing with large datasets, partitioning becomes crucial. Here's how to handle partitioned data efficiently: Python def process_partitioned_data(base_path, partition_column, sensitive_columns): """ Process partitioned data efficiently Parameters: - base_path: Base path to partitioned data - partition_column: Column used for partitioning (e.g., 'date') - sensitive_columns: List of columns to mask """ conn = duckdb.connect(':memory:') try: # 1. List all partitions query = f""" WITH partitions AS ( SELECT DISTINCT {partition_column} FROM read_parquet('{base_path}/*/*.parquet') ) SELECT * FROM partitions; """ This function demonstrates several important concepts: Dynamic partition discoveryMemory-efficient processingError handling with proper cleanupMasked data output generation The partition structure typically looks like: Partition Structure Plain Text sample_data/ ├── date=2024-01-01/ │ └── data.parquet ├── date=2024-01-02/ │ └── data.parquet └── date=2024-01-03/ └── data.parquet Sample Data Plain Text Original Data: date customer_id email phone amount 2024-01-01 1 user1@email.com 123-456-0001 500.00 2024-01-01 2 user2@email.com 123-456-0002 750.25 ... Masked Data: date customer_id email phone amount 2024-01-01 1 **** **** 500.00 2024-01-01 2 **** **** 750.25 Below are some benefits of partitioned processing: Reduced memory footprintParallel processing capabilityImproved performanceScalable data handling Performance Optimization Techniques 1. Configuring Parallel Processing Python # Optimize for performance conn.execute(""" SET partial_streaming=true; SET threads=4; SET memory_limit='4GB'; """) These settings: Enable partial streaming for better memory managementSet parallel processing threadsDefine memory limits to prevent overflow 2. Robust Error Handling Python def robust_s3_read(s3_path, max_retries=3): """ Implement reliable S3 data reading with retries. Parameters: - s3_path: Path to S3 data - max_retries: Maximum retry attempts """ for attempt in range(max_retries): try: return conn.execute(f"SELECT * FROM read_parquet('{s3_path}')") except Exception as e: if attempt == max_retries - 1: raise time.sleep(2 ** attempt) # Exponential backoff This code block demonstrates how to implement retries and also throw exceptions where needed so as to take proactive measures. 3. Storage Optimization Python # Efficient data storage with compression conn.execute(""" COPY (SELECT * FROM masked_data) TO 's3://output-bucket/masked_data.parquet' (FORMAT 'parquet', COMPRESSION 'ZSTD'); """) This code block demonstrates applying storage compression type for optimizing the storage. Best Practices and Recommendations Security Best Practices Security is crucial when handling data, especially in cloud environments. Following these practices helps protect sensitive information and maintain compliance: IAM roles. Use AWS Identity and Access Management roles instead of direct access keys when possibleKey rotation. Implement regular rotation of access keysLeast privilege. Grant minimum necessary permissionsAccess monitoring. Regularly review and audit access patterns Why it's important: Security breaches can lead to data leaks, compliance violations, and financial losses. Proper security measures protect both your organization and your users' data. Performance Optimization Optimizing performance ensures efficient resource utilization and faster data processing: Partition sizing. Choose appropriate partition sizes based on data volume and processing patternsParallel processing. Utilize multiple threads for faster processingMemory management. Monitor and optimize memory usageQuery optimization. Structure queries for maximum efficiency Why it's important: Efficient performance reduces processing time, saves computational resources, and improves overall system reliability. Error Handling Robust error handling ensures reliable data processing: Retry mechanisms. Implement exponential backoff for failed operationsComprehensive logging. Maintain detailed logs for debuggingStatus monitoring. Track processing progressEdge cases. Handle unexpected data scenarios Why it's important: Proper error handling prevents data loss, ensures processing completeness, and makes troubleshooting easier. Conclusion Cloud data processing with DuckDB and AWS S3 offers a powerful combination of performance and security. Let me know how your DuckDb implementation goes!error handling
Productivity and Organization Tips for Software Engineers
February 4, 2025
by
CORE
Soft Skills Are as Important as Hard Skills for Developers
January 28, 2025
by
CORE
Stop Shipping Waste: Fix Your Product Backlog
January 28, 2025
by
CORE
A Guide to Using Amazon Bedrock Prompts for LLM Integration
February 7, 2025 by
Relational DB Migration to S3 Data Lake Via AWS DMS, Part I
February 7, 2025 by
A View on Understanding Non-Human Identities Governance
February 7, 2025 by
AOP for Post-Processing REST Requests With Spring and AspectJ
February 7, 2025 by
A Guide to Using Amazon Bedrock Prompts for LLM Integration
February 7, 2025 by
Relational DB Migration to S3 Data Lake Via AWS DMS, Part I
February 7, 2025 by
React Callback Refs: What They Are and How to Use Them
February 7, 2025 by
AOP for Post-Processing REST Requests With Spring and AspectJ
February 7, 2025 by
A Guide to Using Amazon Bedrock Prompts for LLM Integration
February 7, 2025 by
A Guide to Using Amazon Bedrock Prompts for LLM Integration
February 7, 2025 by
Relational DB Migration to S3 Data Lake Via AWS DMS, Part I
February 7, 2025 by
Exploring Operator, OpenAI’s New AI Agent
February 6, 2025
by
CORE
React Callback Refs: What They Are and How to Use Them
February 7, 2025 by
AOP for Post-Processing REST Requests With Spring and AspectJ
February 7, 2025 by
A Guide to Using Amazon Bedrock Prompts for LLM Integration
February 7, 2025 by