DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Maintenance

A developer's work is never truly finished once a feature or change is deployed. There is always a need for constant maintenance to ensure that a product or application continues to run as it should and is configured to scale. This Zone focuses on all your maintenance must-haves — from ensuring that your infrastructure is set up to manage various loads and improving software and data quality to tackling incident management, quality assurance, and more.

icon
Latest Premium Content
Refcard #346
Microservices and Workflow Engines
Microservices and Workflow Engines
Refcard #336
Value Stream Management Essentials
Value Stream Management Essentials
Refcard #332
Quality Assurance Patterns and Anti-Patterns
Quality Assurance Patterns and Anti-Patterns

DZone's Featured Maintenance Resources

Incremental Jobs and Data Quality Are On a Collision Course

Incremental Jobs and Data Quality Are On a Collision Course

By Jack Vanlightly
If you keep an eye on the data space ecosystem like I do, then you’ll be aware of the rise of DuckDB and its message that big data is dead. The idea comes from two industry papers (and associated data sets), one from the Redshift team (paper and dataset) and one from Snowflake (paper and dataset). Each paper analyzed the queries run on their platforms, and some surprising conclusions were drawn — one being that most queries were run over quite small data. The conclusion (of DuckDB) was that big data was dead, and you could use simpler query engines rather than a data warehouse. It’s far more nuanced than that, but data shows that most queries are run over smaller datasets. Why? On the one hand, many data sets are inherently small, corresponding to things like people, products, marketing campaigns, sales funnel, win/loss rates, etc. On the other hand, there are inherently large data sets (such as clickstreams, logistics events, IoT, sensor data, etc) that are increasingly being processed incrementally. Why the Trend Towards Incremental Processing? Incremental processing has a number of advantages: It can be cheaper than recomputing the entire derived dataset again (especially if the source data is very big).Smaller precomputed datasets can be queried more often without huge costs.It can lower the time to insight. Rather than a batch job running on a schedule that balances cost vs timeliness, an incremental job keeps the derived dataset up-to-date so that it’s only minutes or low-hours behind the real world.More and more software systems act on the output of analytics jobs. When the output was a report, once a day was enough. When the output feeds into other systems that take actions based on the data, these arbitrary delays caused by periodic batch jobs make less sense. Going incremental, while cheaper in many cases, doesn’t mean we’ll use less compute though. The Jevons paradox is an economic concept that occurs where technological advancements leading to increased efficiency in the use of a resource lead to a paradoxical increase in the overall consumption of that resource rather than a decrease. Greater resource efficiency leads people to believe that we won’t use as much of the resource, but the reality is that this often causes more consumption of the resource due to greater demand. Using this intuition of the Jevons Paradox, we can expect this trend of incremental computation to lead to more computing resources being used in analytics rather than less. We can now: Run dashboards with lower refresh rates.Generate reports sooner.Utilize analytical data in more user-facing applications.Utilize analytical data to drive actions in other software systems. As we make analytics more cost-efficient in lower latency workloads, the demand for those workloads will undoubtedly increase (by finding new use cases that weren’t economically viable before). The rise of GenAI is another driver of demand (though definitely not making analytics cheaper!). Many data systems and data platforms already support incremental computation: Real-time OLAP: ClickHouse/Apache Pinot/Apache Druid all provide incremental precomputed tables.Cloud DWH/lake house Snowflake materialized views.Databricks DLT.DBT incremental jobs.Apache Spark jobs.Incremental capabilities of the open table formatsIncremental ingestion jobs.Stream processing Apache Flink.Spark Structured StreamingMaterialize (a streaming database that maintains materialized views over streams). While the technology for incremental computation is already largely here, many organizations aren’t actually ready for a switch to incremental from periodic batch. The Collision Course Modern data engineering is emancipating ourselves from an uncontrolled flow of upstream changes that hinders our ability to deliver quality data. – Julien Le Dem The collision: Bad things happen when uncontrolled changes collide with incremental jobs that feed their output back into other software systems or pollute other derived data sets. Reacting to changes is a losing strategy – Jack Vanlightly Many, if not most, organizations are not equipped to realize this future where analytics data drives actions in other software systems and is exposed to users in user-facing applications. A world of incremental jobs raises the stakes on reliability, correctness, uptime (freshness), and general trustworthiness of data pipelines. The problem is that data pipelines are not reliable enough nor cost-effective enough (in terms of human resource costs) to meet this incremental computation trend. We need to rethink the traditional data warehouse architecture where raw data is ingested from across an organization and landed in a set of staging tables to be cleaned up serially and made ready for analysis. As we well know, that leads to constant break-fix work as data sources regularly change, breaking the data pipelines that turn the raw data into valuable insights. That may have been tolerable when analytics was about strategic decision support (like BI), where the difference of a few hours or a day might not be a disaster. But in an age where analytics is becoming relevant in operational systems and powering more and more real-time or low-minute workloads, it is clearly not a robust or effective approach. The ingest-raw-data->stage->clean->transform approach has a huge amount of inertia and a lot of tooling, but it is becoming less and less suitable as time passes. For analytics to be effective in a world of lower latency incremental processing and more operational use cases, it has to change. So, What Should We Do Instead? The barrier to improving data pipeline reliability and enabling more business-critical workloads mostly relates to how we organize teams and the data architectures we design. The technical aspects of the problem are well-known, and long-established engineering principles exist to tackle them. The thing we’re missing right now is that the very foundations that analytics is built on are not stable. The onus is on the data team to react quickly to changes in upstream applications and databases. This is clearly not going to work for analytics built on incremental jobs where expectations of timeliness are more easily compromised. Even for batch workloads, the constant break-fix work is a drain on resources and also leads to end users questioning the trustworthiness of reports and dashboards. The current approach of reacting to changes in raw data has come about largely because of Conway’s Law: how the different reporting structures have isolated data teams from the operational estate of applications and services. Without incentives for software and data teams to cooperate, data teams have, for years and years, been breaking one of the cardinal rules for how software systems should communicate. Namely, they reach out to grab the private internal state of applications and services. In the world of software engineering, this is an anti-pattern of epic proportions! It’s All About "Coupling" I could make a software architect choke on his or her coffee if I told them my service was directly reading the database of another service owned by a different team. Why is this such an anti-pattern? Why should it result in spilled coffee and dumbfounded shock? It’s all about coupling. This is a fundamental property of software systems that all software engineering organizations take heed of. When services depend on the private internal workings of other services, even small changes in one service's internal state can propagate unpredictably, leading to failures in distant systems and services. This is the principle of coupling, and we want low coupling. Low coupling means that we can change individual parts of a system without those changes propagating far and wide. The more coupling you have in a system, the more coordination and work are required to keep all parts of the system working. This is the situation data teams still find themselves in today. For this reason, software services expose public interfaces (such as a REST API, gRPC, GraphQL, a schematized queue, or a Kafka topic), carefully modeled, stable, and with careful evolution to avoid breaking changes. A system with many breaking changes has high coupling. In a high coupling world, every time I change my service, I force all dependent services to update as well. No, we either have to perform costly coordination between teams to update services (at the same time) or we get a nasty surprise in production. That is why in software engineering, we use contracts, and we have versioning schemes such as SemVer to govern contract changes. In fact, we have multiple ways of evolving public interfaces without propagating those changes further than they need to. It’s why services depend on contracts and not private internal state. Not only do teams build software that communicates via stable APIs, but the software teams collaborate to provide those APIs that the various teams require. This need for APIs and collaboration has only become larger over time. The average enterprise application or service used to be a bit of an island: it had its ten database tables and didn't really need much more. Increasingly, these applications are drawing on much richer sets of data and forming much more complex webs of dependencies. Given this web of dependencies between applications and services, (1) the number of consumers of each API has risen, and (2) the chance of some API change breaking a downstream service has also risen massively. Stable, versioned APIs between collaborating teams are the key. Data Products (Seriously) This is where data products come in. Like or loathe the term, it’s important. Rather than a data pipeline sucking out the private state of an application, it should consume a data product. Data products are very similar to the REST APIs on the software side. They aren’t totally the same, but they share many of the same concerns: Schemas. The shape of the data, both in terms of structure (the fields and their types) and the legal values (not null, credit card numbers with 16 numbers, etc).Careful evolution of schemas to prevent changes from propagating (we want low coupling). Avoiding breaking changes as much as humanly possible.Uptime, which for data products becomes “data freshness.” Is the data arriving on time? Is it late? Perhaps an SLO or even an SLA determines the data freshness goals. Concretely, data products are consumed as governed data-sharing primitives, such as Kafka topics for streaming data and Iceberg/Hudi tables for tabular data. While the public interface may be a topic or a table, the logic/infra that produces the topic or table may be varied. We really don’t want to just emit events that are mirrors of the private schema of the source database tables (due to the high coupling it causes). Just as REST APIs are not mirrors of the underlying database, the data product also requires some level of abstraction and internal transformation. Gunnar Morling wrote an excellent post on this topic, focused on CDC and how to avoid breaking encapsulation. These data products should be capable of real-time or close to real-time because downstream consumers may also be real-time or incremental. As incremental computation spreads, it becomes a web of incremental vertices with edges between them: a graph of incremental computation that is spread across the operational and analytical estates. While the vertices and edges are different from the web of software services, the underlying principles for building reliable and robust systems are the same — low coupling architectures based on stable, evolvable contracts. Because data flows across boundaries, data products should be based on open standards, just as software service contracts are built on HTTP and gRPC. They should come with tooling for schema evolution, access controls, encryption/data masking, data validation rules, etc. More than that, they should come with an expectation of stability and reliability — which comes about from mature engineering discipline and prioritizing these much-needed properties. These data products are owned by the data producers rather than the data consumers (who have no power to govern application databases). It’s not possible for a data team to own the data product whose source is another team’s application or database and expect it to be both sustainable and reliable. Again, I could make a software architect choke on their coffee, suggesting that my software team should build and maintain a REST API (we desperately need) that serves the data of another team’s database. Consumers don’t manage the APIs of source data; it’s the job of the data owner, aka the data producer. This is a hard truth for data analytics but one that is unquestioned in software engineering. The Challenge Ahead What I am describing is Shift Left applied to data analytics. The idea of shifting left is acknowledging that data analytics can’t be a silo where we dump raw data, clean it up, and transform it into something useful. It’s the way it has been done for so long with multi-hop architectures it’s really hard to consider something else. But look at how software engineers build a web of software services that consume each other's data (in real-time) – software teams are doing things very differently. The most challenging aspect of Shift Left is that it changes roles and responsibilities that are now ingrained in the enterprise. This is just how things have been done for a long time. That’s why I think Shift Left will be a gradual trend as it has to overcome this huge inertia. The role of data analytics systems has gone from reporting alone to now including or feeding running-the-business applications. Delaying the delivery of a report for a few hours was tolerable, but in operational systems, hours of downtime can mean huge amounts of lost revenue, so the importance of building reliable (low-coupling) systems has increased. What is holding back analytics right now is that it isn’t reliable enough, it isn’t fast enough, and it has the constant drain of reacting to change (with no control over the timing or shape of those changes). Organizations that shift responsibility for data to the left will build data analytics pipelines that source their data from reliable, stable sources. Rather than sucking in raw data from across the enterprise and dealing with change as it happens, we should build incremental analytics workloads that are robust in the face of changing applications and databases. Ultimately, it’s about: Solving a people problem (getting data and software teams to work together).Applying sound engineering practices to create robust, low-coupling data architectures that can be fit for purpose for more business-critical workloads. The trend of incremental computation is great, but it only raises the stakes. More
Ulyp: Recording Java Execution Flow for Faster Debugging

Ulyp: Recording Java Execution Flow for Faster Debugging

By Andrey Cheboksarov
The article presents Ulyp, which is an open-source instrumentation agent that records method calls (including arguments and return values) of all third-party libraries of JVM apps. Software engineers can later upload a recording file to the UI desktop app in order to better understand the internals of libraries and even all the applications. The tool can help developers understand the internals of frameworks faster, gain deeper insights, find inefficiencies in software, and debug more effectively. In a few words, Ulyp allows to run this code, which sets up a database source, a cache over the source, and then queries the cache: Java // a database source (backed by H2 database) DatabaseJDBCSource source = new DatabaseJDBCSource(); // build a cache LoadingCache<Integer, DatabaseEntity> cache = Caffeine.newBuilder() .maximumSize(10_000) .expireAfterWrite(Duration.ofMinutes(5)) .refreshAfterWrite(Duration.ofMinutes(1)) .build(source::findById); DatabaseEntity fromCache = cache.get(5); // get from the cache And extract the execution flow information: Take a minute to get an understanding of what you see. That's a call tree of all methods. We also captured object values and their identity hash codes. Read further if you want to know how Ulyp is implemented inside. The article also provides several examples of using the agent. Challenges in Software Engineering The scale of software solutions is nowhere near close to its state years ago. Typical apps may have hundreds of instances running across multiple availability zones. The number of frameworks and libraries used as a dependency in a typical app is also higher than it was before. That is not to say that those frameworks may be gigantic. Working on large codebases with hundreds of thousands of lines of code is not an easy task. In many situations, such codebases have developed over a long period, and we might have access to only a few experts with a detailed understanding of the entire codebase. In enterprise applications, the absence or scarcity of developer documentation is a common issue. In such situations, onboarding a new engineer is more than challenging. An average engineer spends way more time reading code than writing it. Understanding how libraries and frameworks work inside and what they do is extremely vital for successful Java software engineers since it allows them to write more robust and performant code. Another problem is debugging a running instance of an application in some environment where a classic debugger might not be available. Usually, it’s possible to use logs and APM tracers, but these tools might not always suffice the needs. One possible way to alleviate some of these problems is code execution recording. The idea is by far not new, as there are already dozens of time-travel debuggers for different languages. This effectively eliminates the need for breakpoints in certain cases, as a software engineer can just observe the whole execution flow. It’s also feasible to record the execution in several apps simultaneously via remote control. This allows us to record what happened in a distributed environment. Technical Design Ulyp is an instrumentation agent written specifically for this task. Recording all function calls along with return values and arguments is possible thanks to JVM bytecode instrumentation. Bytecode instrumentation is a technique used to modify the bytecode of a Java application at runtime. It essentially means we can change the code of the Java app after it has started. Currently, Ulyp uses a byte-buddy library, which does an immense job of handling all the work of instrumentation and makes it extremely easy for the entire Java community. One thing byte-buddy allows users to do is to define an advice containing code to wire into methods. Here is an example of such advice: Java public class MethodAdvice { @Advice.OnMethodEnter static void enter( @Advice.This(optional = true) Object callee, @Advice.AllArguments Object[] arguments) { ... agent code here } @Advice.OnMethodExit static void exit( @Advice.Thrown Throwable throwable, @Advice.Return Object returnValue) { ... agent code here } } Every bytecode instruction of the code inside methods is copied to instrumented methods. The agent is free to access references to the object being called (if the method is not static), arguments, return values as well as exceptions thrown out of the method. Our goal is to instrument all third-party library methods to capture their arguments and return values. After instrumentation is done, the agent can essentially intercept method calls. However, we should be really careful when intercepting code. If we do something heavy, we may slow down client app threads. If we do something dangerous (which may throw an exception), the client app may even break. That's exactly why recording most arguments and return values is done in the background thread. It works simply because most objects are either: Immutable; orWe record only their class name and identity hash code. Other objects like collections and arrays are recorded in client app threads. There are usually not so many such objects recorded, so it's not a big issue. Besides, recording collections and arrays is disabled by default and can only be enabled by a user. When capturing argument values, Ulyp uses special recorders. Recorders encode objects into bytes, and the UI app decodes these bytes to show object values to the user. At first glance, they look like common serializers. But unlike serializers, recorders do not capture the exact state of objects. For example, Ulyp only captures the first 200 symbols of String instances (this is configurable). Every thread gathers recording events in special thread-local buffers. Buffers are posted to the background thread, which encodes all events into bytes. Bytes are written to the file which is specified by user. All data is dumped to the file a user-specified via system properties. There are auxiliary threads that do the job of converting objects to binary format and writing to file. The resulting file can later be opened in UI app and entire flow can be analyzed. Enough of talk, let's dive into the examples of Ulyp usages. Example 1: Jackson We start from the simplest example of how we can use Ulyp by looking into Jackson, the most well-known library for JSON parsing in Java ecosystem. While this example does not provide any interesting insights, we still can see how a recording debugger can help us to look really quick into a third party library internals. The demo is quite simple: Java import com.fasterxml.jackson.core.JsonProcessingException; import com.fasterxml.jackson.databind.ObjectMapper; import java.util.List; public class JacksonDemo { static class Person { private String firstName; private String lastName; private int age; private List<String> hobbies; ... } public static void main(String[] args) throws JsonProcessingException { ObjectMapper objectMapper = new ObjectMapper(); String text = "{\"firstName\":\"Peter\",\"lastName\":\"Parker\",\"age\":20,\"hobbies\":[\"Photo\",\"Jumping\",\"Saving people\"]}"; System.out.println(objectMapper.readValue(text, Person.class)); System.out.println(objectMapper.readValue(text, Person.class)); } } We call the readValue twice, since the first call is heavier, as it includes lazy initialization logic inside the object mapper. We will see it shortly. If we want to use Ulyp, we just specify system properties as follows: Java --add-opens java.base/java.lang=ALL-UNNAMED --add-opens java.base/java.lang.invoke=ALL-UNNAMED -javaagent:~/ulyp-agent-1.0.1.jar -Dulyp.methods=**.ObjectMapper.readValue -Dulyp.file=~/jackson-ulyp-output.dat -Dulyp.record-constructors -Dulyp.record-collections=JDK -Dulyp.record-arrays We add --add-opens props for our code to work properly on Java 21. Next, we specify the path to agent itself. ulyp.methods property allows specifying when recording should start. In this scenario, we record ObjectMapper's method, which calls for parsing text into objects. Then, we set the output file where Ulyp should dump all recording data and props which configure Ulyp to record some collection elements (only standard library containers like ArrayList or HashMap) and arrays, as well as record constructors calls. The prop names are pretty much self-explanatory. Once the program finishes, we can upload the output file to the UI. When we do this, we see the following picture: There is a list of recorded methods on the left-hand side. We are able to observe the duration and number of nested calls for every recorded method. We can see the call tree on the right side. Every recorded method call is marked with black line which hints how many nested calls are inside, so that we can dive into the heaviest methods that contain more logic. If we choose the second method, we will see that the call tree has much fewer nested calls. In fact, it has only 300 calls, while the first method call has 5700! If we dive deep, we will soon get the idea why it's happening. In the first call tree, _findRootDeserializer has a lot of nested calls, while in the second call tree, it doesn't. We can easily guess that this is due to the deserializer instance being cached inside. If we dive deeper, we can observe how the framework does its job in subtle details. For example, we can spot that it processes JSON entry with key firstName and the value is Peter. StringDeserializer is used for parsing value from JSON text. We now have some understanding of how the tool works. However, this example doesn't show anything interesting in particular. Let's now move to something more interesting then. Example 2: Spring Proxy In the second example, we will look into how Spring implements proxies. Spring uses Java annotations to augment the logic of desired beans. The most common example is @Transactional annotation, which allows our method to be executed inside transaction. If one doesn't know how it works, it looks like magic to them since the only thing you do is place an annotation on the class. That's exactly what we have in our example where we have an empty method and the class is marked with the annotation: Java @Component @Transactional public class ExampleService { public void test() { System.out.println("hello"); } } But how exactly does Spring start transactions? Let's find out. We are going to setup a simple demo as follows: Java public class SpringProxyDemo { public static void main(String[] args) { ApplicationContext context = new AnnotationConfigApplicationContext(Configuration.class); ExampleService service = context.getBean(ExampleService.class); service.test(); } } So, we just get the bean from the context and then call the method. It should start a transaction, right? Let's find out. We are going to do exactly the same as before. We just run our code with system props that enable Ulyp. When we open the resulting file in UI, we see this: What we see is our service class name looks weird. What is this ExampleService$$EnhancerBySpringCGLIB$$af4abd82 classname? Turns out, that's how Spring actually implements proxying. So, when we call context.getBean(...), we actually get an instance of a different class. Let's dig deeper. If we expand the call tree, we can observe all major points that make the proxy work. First, DynamicAdvisedInterceptor is called, which determines a set of interceptors for the method. It returns an array with one element, which is an instance of TransactionInterceptor. You can guess, TransactionInterceptor is responsible for opening and commiting a transaction. That's exactly what we see it is doing in our case. It first determines a transaction manager. JpaTransactionManager is configured in our demo, so we are dealing with JPA transactions. We then can observe how a transaction is opened and committed. It's committed after proceedWithInvocation is called, which calls our service inside. We could dive deeper if we wanted. Just expand a method that creates a transaction, and you can easily navigate down to the Hibernate and H2 (database) levels! What About Kotlin? Java is not the only JVM-based language. The other popular one would be Kotlin. It would be nice to have support for it, right? Thankfully, bytecode instrumentation is available for any JVM-based language, and byte-buddy handles Kotlin instrumentation as well. To verify this, I decided to toy around with kotlin-ktor-exposed-starter. This example repo features some popular Kotlin libraries like exposed and ktor, so we have a chance to test how recording works. We jump straight to recorded methods of the WidgetService class, which loads widgets from the database. The call tree has a lot of calls to Kotlin standard library, which were invisible for us when we code (marked with red): Thankfully, Ulyp comes with property ulyp.exclude-packages which can disable instrumentation for certain packages. So, if we run an app with -Dulyp.exclude-packages=kotlin, we can observe that we no longer see these methods. Second, we can toggle Kotlin collection recording with -Dulyp.record-collections=JDK,KT. An additional option "KT" activates recording of collections which are part of Kotlin standard library. Overall, the picture is more nice and clear now: How Much Does It Cost? Performance Instrumentation overhead is quite severe, which can slow down app startup by several times. Recording overhead also slows down the execution. The overhead depends on the app type. For typical Java app, the slowdown is somewhat about x2-x5. For CPU-intensive apps, the overhead can be even larger. Overall, from my experience, it's not that scary, and you can trace and record even real-time apps provided that you run them locally or in the development environment. Memory Currently, Ulyp doesn't consume much memory in the heap. However, Ulyp can double-code cache usage. For gigantic apps, code cache tuning may be required. See the link above for more information on how to change the code cache size. If the software is launched locally, it's not required anyway. Conclusion This is the most simple demo for using Ulyp. There are also different examples of using it which may be covered in separate articles. Ulyp doesn’t try to solve all the existing problems and it’s definitely not a silver bullet. The overhead of instrumenting could be quite high. You might now want to run it on production environment, but dev/test are usually ok. But if one can run their software app locally or on dev environment, it opens the opportunity to see things at completely different angle. To sum it up, let's highlight cases where Ulyp can help: Project onboarding: A software engineer is able to record and analyze the whole execution flow for entire app.Debug code: A software engineer can understand what library is doing in mere minutes. This might be beneficial if an engineer works with third-party libraries.Tracing a set of apps running somewhere in cloud at once: I.e. debugging a distributed system. This is a tricky case which will be covered in a separate article. Using such tool can sometimes be not so good idea. Such cases include: Using the tool on production environment: This is quite straightforward. Anyway, Just don't.Performance sensitive workload: The overhead of recording could be quite high. Typical enterprise app is several times slower while being recorded. With CPU-bound Java apps it is even worse. However, nothing really stops us from using the tool on performance sensitive app if your goal is debugging. Thanks for reading. More
Stress Testing for Resilience in Modern Infrastructure
Stress Testing for Resilience in Modern Infrastructure
By Ankush Madaan
5 Signs You’ve Built a Secretly Bad Architecture (And How to Fix It)
5 Signs You’ve Built a Secretly Bad Architecture (And How to Fix It)
By John Vester DZone Core CORE
Leveraging AIOps for Observability Workflows: How to Improve the Scalability and Intelligence of Observability
Leveraging AIOps for Observability Workflows: How to Improve the Scalability and Intelligence of Observability
By Pranav Kumar Chaudhary
Front-End Debugging Part 2: Console.log() to the Max
Front-End Debugging Part 2: Console.log() to the Max

In my previous post, I talked about why Console.log() isn’t the most effective debugging tool. In this installment, we will do a bit of an about-face and discuss the ways in which Console.log() is fantastic. Let’s break down some essential concepts and practices that can make your debugging life much easier and more productive. Front-End Logging vs. Back-End Logging Front-end logging differs significantly from back-end logging, and understanding this distinction is crucial. Unlike back-end systems, where persistent logs are vital for monitoring and debugging, the fluid nature of front-end development introduces different challenges. When debugging backends, I’d often go for tracepoints, which are far superior in that setting. However, the frontend, with its constant need to refresh, reload, contexts switch, etc., is a very different beast. In the frontend, relying heavily on elaborate logging mechanisms can become cumbersome. While tracepoints remain superior to basic print statements, the continuous testing and browser reloading in front-end workflows lessen their advantage. Moreover, features like logging to a file or structured ingestion are rarely useful in the browser, diminishing the need for a comprehensive logging framework. However, using a logger is still considered best practice over the typical Console.log for long-term logging. For short-term logging Console.log has some tricks up its sleeve. Leveraging Console Log Levels One of the hidden gems of the browser console is its support for log levels, which is a significant step up from rudimentary print statements. The console provides five levels: log: Standard loggingdebug: Same as log but used for debugging purposesinfo: Informative messages, often rendered like log/debugwarn: Warnings that might need attentionerror: Errors that have occurred While log and debug can be indistinguishable, these levels allow for a more organized and filtered debugging experience. Browsers enable filtering the output based on these levels, mirroring the capabilities of server-side logging systems and allowing you to focus on relevant messages. Customizing Console Output With CSS Front-end development allows for creative solutions, and logging is no exception. Using CSS styles in the console can make logs more visually distinct. By utilizing %c in a console message, you can apply custom CSS: CSS console.customLog = function(msg) { console.log("%c" + msg,"color:black;background:pink;font-family:system-ui;font-size:4rem;-webkit-text-stroke: 1px black;font-weight:bold") } console.customLog("Dazzle") This approach is helpful when you need to make specific logs stand out or organize output visually. You can use multiple %c substitutions to apply various styles to different parts of a log message. Stack Tracing With console.trace() The console.trace() method can print a stack trace at a particular location, which can sometimes be helpful for understanding the flow of your code. However, due to JavaScript’s asynchronous behavior, stack traces aren’t always as straightforward as back-end debugging. Still, it can be quite valuable in specific scenarios, such as synchronous code segments or event handling. Assertions for Design-by-Contract Assertions in front-end code allow developers to enforce expectations and promote a “fail-fast” mentality. Using Console.assert(), you can test conditions: JavaScript console.assert(x > 0, 'x must be greater than zero'); In the browser, a failed assertion appears as an error, similar to console.error. An added benefit is that assertions can be stripped from production builds, removing any performance impact. This makes assertions a great tool for enforcing design contracts during development without compromising production efficiency. Printing Tables for Clearer Data Visualization When working with arrays or objects, displaying data as tables can significantly enhance readability. The console.table() method allows you to output structured data easily: JavaScript console.table(["Simple Array", "With a few elements", "in line"]) This method is especially handy when debugging arrays of objects, presenting a clear, tabular view of the data and making complex data structures much easier to understand. Copying Objects to the Clipboard Debugging often involves inspecting objects, and the copy(object) method allows you to copy an object’s content to the clipboard for external use. This feature is useful when you need to transfer data or analyze it outside the browser. Inspecting With console.dir() and dirxml() The console.dir() method provides a more detailed view of objects, showing their properties as you’d see in a debugger. This is particularly helpful for inspecting DOM elements or exploring API responses. Meanwhile, console.dirxml() allows you to view objects as XML, which can be useful when debugging HTML structures. Counting Function Calls Keeping track of how often a function is called or a code block is executed can be crucial. The console.count() method tracks the number of times it’s invoked, helping you verify that functions are called as expected: JavaScript function myFunction() { console.count('myFunction called'); } You can reset the counter using console.countReset(). This simple tool can help you catch performance issues or confirm the correct execution flow. Organizing Logs With Groups To prevent log clutter, use console groups to organize related messages. console.group() starts a collapsible log section and console.groupEnd() closes it: JavaScript console.group('My Group'); console.log('Message 1'); console.log('Message 2'); console.groupEnd(); Grouping makes it easier to navigate complex logs and keeps your console clean. Chrome-Specific Debugging Features Monitoring Functions: Chrome’s monitor() method logs every call to a function, showing the arguments and enabling a method-tracing experience. Monitoring Events: Using monitorEvents(), you can log events on an element. This is useful for debugging UI interactions. For example, monitorEvents(window, 'mouseout') logs only mouseout events. Querying Object Instances: queryObjects(Constructor) lists all objects created with a specific constructor, giving you insights into memory usage and object instantiation. Final Word Front-end debugging tools have come a long way. These tools provide a rich set of features that go far beyond simple console.log() statements. From log levels and CSS styling to assertions and event monitoring, mastering these techniques can transform your debugging workflow. If you read this post as part of my series, you will notice a big change in my attitude toward debugging when we reach the front end. Front-end debugging is very different from back-end debugging. When debugging the backend, I’m vehemently against code changes for debugging (e.g., print debugging), but on the frontend, this can be a reasonable hack. The change in environment justifies it. The short lifecycle, the single-user use case, and the risk are smaller. Video

By Shai Almog DZone Core CORE
What Is a Bug Bash?
What Is a Bug Bash?

What Is a Bug Bash? In software development, a Bug Bash is a procedure where all the developers, testers, program managers, usability researchers, designers, documentation folks, and even sometimes marketing people, put aside their regular day-to-day duties and “pound on the product” —  that is, each exercises the product in every way they can think of. Because each person will use the product in slightly different (or very different) ways, and the product is getting a great deal of use in a short amount of time, this approach may reveal bugs relatively quickly. [Wikipedia] Putting it in a simpler way, Bug Bashes are organized by the software teams where people collectively from all the relevant departments join hands to check the product and discover if it is fit for production release. The bugs, observations, and feedback received from the different participants are recorded and accordingly, a plan is created to fix them before the product is released to the end users. What Is the Objective Behind Organizing a Bug Bash? As the name and definition already mention, it is Bug Bash, so the major objective is to find bugs hidden in the application and resolve them before it reaches the end users of the product. However, The developers, business analysts, Project Managers, and quality analysts should all be on the same page. There should not be any blame game once a bug is found, as this is a collective team effort to make the application more stable and robust by providing faster feedback, so as to release a quality product in the market. Why Should a Bug Bash Be Conducted? Most importantly, Bug Bashes are a relatively cheaper way to find a large number of bugs in a short span of time.Bug bashes provide an opportunity for everyone in the organization to learn about the different parts of the product they are less familiar with.It improves cross-team collaborations, communication, and relationships.Helps to test the product with a wide variety and combination of devices/browsers/mobile versions/mobile OS’s which in general is very difficult to test in a short span of timePeople with different experiences in the team can collaborate and test the product effectively with different perspectives. Who Facilitates the Bug Bash? Ideally, the Tester or the Test Lead should facilitate the Bug Bash. When to Conduct a Bug Bash Bug Bashes are advised to be conducted before a major release or if there is any critical release that may impact the overall working of the product. The time to schedule may vary according to the collective decision made by the team. Normally, it should be conducted before a week of release or even sooner. A point to note here is that all the cards/tickets that are tagged for the release should be either "QA Done" or "Dev Done" before the Bug Bash is scheduled. It doesn't make any sense to have a Bug Bash if the feature that is going to be released is half-baked. How to Run a Bug Bash A Bug Bash session can be divided into the 3 different phases: Pre-Bug Bash sessionBug Bash sessionPost Bug Bash session 1. Pre-Bug Bash Define the facilitator for the Bug Bash. It would be ideal if 2 QAs could pair and lead this.The owners of the Bug Bash should set up a preparation meeting with the team, explain the agenda of Bug Bash to all the participants, and set up the pre-requisite, if any. If any team member requires any access related to the product/application, this could also possibly be figured out in the preparation call.It would be an added advantage if a representative from the client side could join the Bug Bash. It would help in terms of business requirements.Send out a calendar invite for the Bug Bash to all the participants and ask them to RSVP to it so you can plan out the event successfully.The following points need to be considered while sending the calendar invite: Mention the scope of the Bug Bash.The place where Bug Bash is scheduled to happen: Mention the meeting room details, or else, if it is a Zoom/Teams/Meet Call, update the link in it.Mention the details about the test environment and test data that could be used for testing.Attach the link to the Bug Bash sheet which has all details related to pre-requisites, OS/Browser/tools setup/description of features of the product.If it is a mobile app, do share the link from where the latest build should be downloaded for iOS as well as Android applications.Check if all the participants have access to the Bug Bash sheet as well the required links to download the artifacts (in case of mobile app)/links to the website under test. 2. Bug Bash Session It should be ideally a one-hour session, but could be increased to 90 minutes depending upon the requirement. It all depends on how well it works for you and your team. The facilitator should start by welcoming everyone to the session and explaining to them the scope and details of Bug Bash and ask them to start with it.Once initiated, the facilitator should monitor the participants' activities by checking if they can perform the testing without any blockers. If someone is blocked, he should help them in resolving the queries.Facilitators/coordinators should continuously monitor the Bug Bash sheet where issues are recorded.It should be thoroughly checked that participants are adding the bug details correctly, with proper test steps, screenshots, device/OS/browser details, and also their respective names; otherwise, it would be difficult to reproduce and recheck the same once Bug Bash is complete.Keep an eye on the time as well as once the decided time is reached. A call-out should be done by the facilitator if someone needs more time to perform tests. Accordingly, the session should be extended, if required. The facilitator should thank everyone for their participation and for giving their valuable time for the Bug Bash. 3. Post Bug Bash Session This is the most crucial and important session that needs to be set up. Most importantly, in this session, the business analyst, QAs, and the Product Owner should prioritize the issues reported. This session doesn’t require the whole team to be present. Business analysts, QAs, and Product Owners can meet and analyze the issues reported. They might also need to reproduce the issues and update the steps, if any, in the sheet. It should also be noted that all observations reported in the Bug Bash may not be a bug. Therefore, appropriate clarifications may be required to be taken from the reporter as to what is their perspective in reporting the respective observation. Once that is done and understood, the appropriate action to take would be to mark it as a Bug or Not a Bug. Once the priorities are defined for the bugs reported, tickets/cards should be created in the Sprint board labeling them as Bug Bash bugs. The ones with higher priority should be taken in the current Sprint and resolved at the earliest before the release. Accordingly, tickets/cards should be prepared for the lower priority issues as well and should be placed for the later Sprints. Bug Bash Template A sample “Bugbash_Template.xlsx” has been added inside the "Templates" folder of this GitHub repository which could help you in bug bashing! Conclusion To conclude, a Bug Bash is a great way to perform exploratory testing. It brings in people with different experiences and different teams. It also provides us with a variety of device/OS/browser coverage in a short time, which might help in uncovering the hidden issues. Having someone participating from the client side would help us in getting faster feedback before releasing the product to the end user. Happy Bug Bashing!!

By Faisal Khatri DZone Core CORE
Configuring Autoscaling for Various Machine Learning Model Types
Configuring Autoscaling for Various Machine Learning Model Types

AWS Sagemaker has simplified the deployment of machine learning models at scale. Configuring effective autoscaling policies is crucial for balancing performance and cost. This article aims to demonstrate how to set up various autoscaling policies using TypeScript CDK, focusing on request, memory, and CPU-based autoscaling for different ML model types. Model Types Based on Invocation Patterns At a high level, model deployment in SageMaker can be broken into three main categories based on invocation patterns: 1. Synchronous (Real-Time) Inference Synchronous inference is suitable when immediate response or feedback is required by end users, such as when a website interaction is required. This approach is particularly well-suited for applications that demand quick response times with minimal delay. Examples include fraud detection in financial transactions and dynamic pricing in ride-sharing. 2. Asynchronous Inference Asynchronous inference is ideal for handling queued requests when it is acceptable to process messages with a delay. This type of inference is preferred when the model is memory/CPU intensive and takes more than a few seconds to respond. For instance, video content moderation, analytics pipeline, and Natural Language Processing (NLP) for textbooks. 3. Batch Processing Batch processing is ideal when data needs to be processed in chunks (batches) or at scheduled intervals. Batch processing is mostly used for non-time-sensitive tasks when you need the output to be available at periodic intervals like daily or weekly. For example, periodic recommendation updates, where an online retailer generates personalized product recommendations for its customers weekly. Predictive maintenance, where daily jobs are run to predict machines that are likely to fail, is another good example. Types of Autoscaling in SageMaker With CDK Autoscaling in SageMaker can be tailored to optimize different aspects of performance based on the model’s workload: 1. Request-Based Autoscaling Use Case Best for real-time (synchronous) inference models that need low latency. Example Scaling up during peak shopping seasons for an e-commerce recommendation model to meet high traffic. 2. Memory-Based Autoscaling Use Case Beneficial for memory-intensive models, such as large NLP models. Example Increasing instance count when memory usage exceeds 80% for image processing models that require high resolution. 3. CPU-Based Autoscaling Use Case Ideal for CPU-bound models that require more processing power. Example Scaling for high-performance recommendation engines by adjusting instance count as CPU usage reaches 75%. Configuring Autoscaling Policies in TypeScript CDK Below is an example configuration of different scaling policies using AWS CDK with TypeScript: TypeScript import * as cdk from 'aws-cdk-lib'; import * as sagemaker from 'aws-cdk-lib/aws-sagemaker'; import * as autoscaling from 'aws-cdk-lib/aws-applicationautoscaling'; import { Construct } from 'constructs'; export class SageMakerEndpointStack extends cdk.Stack { constructor(scope: Construct, id: string, props?: cdk.StackProps) { super(scope, id, props); const AUTO_SCALE_CONFIG = { MIN_CAPACITY: 1, MAX_CAPACITY: 3, TARGET_REQUESTS_PER_INSTANCE: 1000, CPU_TARGET_UTILIZATION: 70, MEMORY_TARGET_UTILIZATION: 80 }; // Create SageMaker Endpoint const endpointConfig = new sagemaker.CfnEndpointConfig(this, 'EndpointConfig', { productionVariants: [{ modelName: 'YourModelName', // Replace with your model name variantName: 'prod', initialInstanceCount: AUTO_SCALE_CONFIG.MIN_CAPACITY, instanceType: 'ml.c5.2xlarge' }] }); const endpoint = new sagemaker.CfnEndpoint(this, 'Endpoint', { endpointName: 'YourEndpointName', // Replace with your endpoint name endpointConfig: endpointConfig }); // Set up autoscaling const scalableTarget = endpoint.createScalableInstanceCount({ minCapacity: AUTO_SCALE_CONFIG.MIN_CAPACITY, maxCapacity: AUTO_SCALE_CONFIG.MAX_CAPACITY }); this.setupRequestBasedAutoscaling(scalableTarget); this.setupCpuBasedAutoscaling(scalableTarget, endpoint); this.setupMemoryBasedAutoscaling(scalableTarget, endpoint); this.setupStepAutoscaling(scalableTarget, endpoint); } private setupRequestBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount) { scalableTarget.scaleOnRequestCount('ScaleOnRequestCount', { targetRequestsPerInstance: AUTO_SCALE_CONFIG.TARGET_REQUESTS_PER_INSTANCE }); } private setupCpuBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) { scalableTarget.scaleOnMetric('ScaleOnCpuUtilization', { metric: endpoint.metricCPUUtilization(), targetValue: AUTO_SCALE_CONFIG.CPU_TARGET_UTILIZATION }); } private setupMemoryBasedAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) { scalableTarget.scaleOnMetric('ScaleOnMemoryUtilization', { metric: endpoint.metricMemoryUtilization(), targetValue: AUTO_SCALE_CONFIG.MEMORY_TARGET_UTILIZATION }); } // Example configuration of step scaling. // Changes the number of instances to scale up and down based on CPU usage private setupStepAutoscaling(scalableTarget: sagemaker.ScalableInstanceCount, endpoint: sagemaker.CfnEndpoint) { scalableTarget.scaleOnMetric('StepScalingOnCpu', { metric: endpoint.metricCPUUtilization(), scalingSteps: [ { upper: 30, change: -1 }, { lower: 60, change: 0 }, { lower: 70, upper: 100, change: 1 }, { lower: 100, change: 2 } ], adjustmentType: autoscaling.AdjustmentType.CHANGE_IN_CAPACITY }); } } Note: CPU metrics can exceed 100% when instances have multiple cores, as they measure total CPU utilization. Balancing Autoscaling Policies by Model Type Autoscaling policies differ based on model requirements: Batch Processing Models Request- or CPU-based autoscaling is ideal here since you won't have to pay for resources when traffic is low or none. Synchronous Models In order to provide a swift response to spikes in real-time requests, request-based autoscaling is recommended. Asynchronous Models CPU-based scaling with longer cooldowns prevents over-scaling and maintains efficiency. Key Considerations for Effective Autoscaling 1. Cost Management Tune metric thresholds to optimize cost without sacrificing performance. 2. Latency Requirements For real-time models, prioritize low-latency scaling; batch and asynchronous models can handle slight delays. 3. Performance Monitoring Regularly assess model performance and adjust configurations to adapt to demand changes. Like in the example above, we can use more than one autoscaling policy to balance cost and performance, but that can lead to increased complexity in setup and management. Conclusion With AWS SageMaker's autoscaling options, you can effectively configure resource management for different types of ML models. By setting up request-based, memory-based, and CPU-based policies in CDK, you can optimize both performance and costs across diverse applications.

By Koushik Balaji Venkatesan
Primer on Distributed Parallel Processing With Ray Using KubeRay
Primer on Distributed Parallel Processing With Ray Using KubeRay

In the early days of computing, applications handled tasks sequentially. As the scale grew with millions of users, this approach became impractical. Asynchronous processing allowed handling multiple tasks concurrently, but managing threads/processes on a single machine led to resource constraints and complexity. This is where distributed parallel processing comes in. By spreading the workload across multiple machines, each dedicated to a portion of the task, it offers a scalable and efficient solution. If you have a function to process a large batch of files, you can divide the workload across multiple machines to process files concurrently instead of handling them sequentially on one machine. Additionally, it improves performance by leveraging combined resources and provides scalability and fault tolerance. As the demands increase, you can add more machines to increase available resources. It is challenging to build and run distributed applications on scale, but there are several frameworks and tools to help you out. In this blog post, we’ll examine one such open-source distributed computing framework: Ray. We’ll also look at KubeRay, a Kubernetes operator that enables seamless Ray integration with Kubernetes clusters for distributed computing in cloud-native environments. But first, let’s understand where distributed parallelism helps. Where Does Distributed Parallel Processing Help? Any task that benefits from splitting its workload across multiple machines can utilize distributed parallel processing. This approach is particularly useful for scenarios such as web crawling, large-scale data analytics, machine learning model training, real-time stream processing, genomic data analysis, and video rendering. By distributing tasks across multiple nodes, distributed parallel processing significantly enhances performance, reduces processing time, and optimizes resource utilization, making it essential for applications that require high throughput and rapid data handling. When Distributed Parallel Processing Is Not Needed Small-scale applications: For small datasets or applications with minimal processing requirements, the overhead of managing a distributed system may not be justified.Strong data dependencies: If tasks are highly interdependent and cannot be easily parallelized, distributed processing may offer little benefit.Real-time constraints: Some real-time applications (e.g., finance and ticket booking websites) require extremely low latency, which might not be achievable with the added complexity of a distributed system.Limited resources: If the available infrastructure cannot support the overhead of a distributed system (e.g., insufficient network bandwidth, limited number of nodes), it may be better to optimize single-machine performance. How Ray Helps With Distributed Parallel Processing Ray is a distributed parallel processing framework that encapsulates all the benefits of distributed computing and solutions to the challenges we discussed, such as fault tolerance, scalability, context management, communication, and so on. It is a Pythonic framework, allowing the use of existing libraries and systems to work with it. With Ray’s help, a programmer doesn’t need to handle the pieces of the parallel processing compute layer. Ray will take care of scheduling and autoscaling based on the specified resource requirements. Ray provides a universal API of tasks, actors, and objects for building distributed applications.(Image Source) Ray provides a set of libraries built on the core primitives, i.e., Tasks, Actors, Objects, Drivers, and Jobs. These provide a versatile API to help build distributed applications. Let’s take a look at the core primitives, a.k.a., Ray Core. Ray Core Primitives Tasks: Ray tasks are arbitrary Python functions that are executed asynchronously on separate Python workers on a Ray cluster node. Users can specify their resource requirements in terms of CPUs, GPUs, and custom resources which are used by the cluster scheduler to distribute tasks for parallelized execution.Actors: What tasks are to functions, actors are to classes. An actor is a stateful worker, and the methods of an actor are scheduled on that specific worker and can access and mutate the state of that worker. Like tasks, actors support CPU, GPU, and custom resource requirements.Objects: In Ray, tasks and actors create and compute objects. These remote objects can be stored anywhere in a Ray cluster. Object References are used to refer to them, and they are cached in Ray’s distributed shared memory object store.Drivers: The program root, or the “main” program: this is the code that runs ray.init()Jobs: The collection of tasks, objects, and actors originating (recursively) from the same driver and their runtime environment For information about primitives, you can go through the Ray Core documentation. Ray Core Key Methods Below are some of the key methods within Ray Core that are commonly used: ray.init() - Start Ray runtime and connect to the Ray cluster. import ray ray.init() @ray.remote - Decorator that specifies a Python function or class to be executed as a task (remote function) or actor (remote class) in a different process @ray.remote def remote_function(x): return x * 2 .remote - Postfix to the remote functions and classes; remote operations are asynchronous result_ref = remote_function.remote(10) ray.put() - Put an object in the in-memory object store; returns an object reference used to pass the object to any remote function or method call. data = [1, 2, 3, 4, 5] data_ref = ray.put(data) ray.get() - Get a remote object(s) from the object store by specifying the object reference(s). result = ray.get(result_ref) original_data = ray.get(data_ref) Here is an example of using most of the basic key methods: import ray ray.init() @ray.remote def calculate_square(x): return x * x # Using .remote to create a task future = calculate_square.remote(5) # Get the result result = ray.get(future) print(f"The square of 5 is: {result}") How Does Ray Work? Ray Cluster is like a team of computers that share the work of running a program. It consists of a head node and multiple worker nodes. The head node manages the cluster state and scheduling, while worker nodes execute tasks and manage actor A Ray cluster Ray Cluster Components Global Control Store (GCS): The GCS manages the metadata and global state of the Ray cluster. It tracks tasks, actors, and resource availability, ensuring that all nodes have a consistent view of the system.Scheduler: The scheduler distributes tasks and actors across available nodes. It ensures efficient resource utilization and load balancing by considering resource requirements and task dependencies.Head node: The head node orchestrates the entire Ray cluster. It runs the GCS, handles task scheduling, and monitors the health of worker nodes.Worker nodes: Worker nodes execute tasks and actors. They perform the actual computations and store objects in their local memory.Raylet: It manages shared resources on each node and is shared among all concurrently running jobs. You can check out the Ray v2 Architecture doc for more detailed information. Working with existing Python applications doesn’t require a lot of changes. The changes required would mainly be around the function or class that needs to be distributed naturally. You can add a decorator and convert it into tasks or actors. Let’s see an example of this. Converting a Python Function Into a Ray Task Python # (Normal Python function) def square(x): return x * x # Usage results = [] for i in range(4): result = square(i) results.append(result) print(results) # Output: [0, 1, 4, 9] # (Ray Implementation) # Define the square task. @ray.remote def square(x): return x * x # Launch four parallel square tasks. futures = [square.remote(i) for i in range(4)] # Retrieve results. print(ray.get(futures)) # -> [0, 1, 4, 9] Converting a Python Class Into Ray Actor Python # (Regular Python class) class Counter: def __init__(self): self.i = 0 def get(self): return self.i def incr(self, value): self.i += value # Create an instance of the Counter class c = Counter() # Call the incr method on the instance for _ in range(10): c.incr(1) # Get the final state of the counter print(c.get()) # Output: 10 # (Ray implementation in actor) # Define the Counter actor. @ray.remote class Counter: def __init__(self): self.i = 0 def get(self): return self.i def incr(self, value): self.i += value # Create a Counter actor. c = Counter.remote() # Submit calls to the actor. These # calls run asynchronously but in # submission order on the remote actor # process. for _ in range(10): c.incr.remote(1) # Retrieve final actor state. print(ray.get(c.get.remote())) # -> 10 Storing Information in Ray Objects Python import numpy as np # (Regular Python function) # Define a function that sums the values in a matrix def sum_matrix(matrix): return np.sum(matrix) # Call the function with a literal argument value print(sum_matrix(np.ones((100, 100)))) # Output: 10000.0 # Create a large array matrix = np.ones((1000, 1000)) # Call the function with the large array print(sum_matrix(matrix)) # Output: 1000000.0 # (Ray implementation of function) import numpy as np # Define a task that sums the values in a matrix. @ray.remote def sum_matrix(matrix): return np.sum(matrix) # Call the task with a literal argument value. print(ray.get(sum_matrix.remote(np.ones((100, 100))))) # -> 10000.0 # Put a large array into the object store. matrix_ref = ray.put(np.ones((1000, 1000))) # Call the task with the object reference as argument. print(ray.get(sum_matrix.remote(matrix_ref))) # -> 1000000.0 To learn more about its concept, head over to Ray Core Key Concept docs. Ray vs Traditional Approach of Distributed Parallel Processing Below is a comparative analysis between the traditional (without Ray) approach vs Ray on Kubernetes to enable distributed parallel processing. AspectTraditional ApproachRay on KubernetesDeploymentManual setup and configurationAutomated with KubeRay OperatorScalingManual scalingAutomatic scaling with RayAutoScaler and KubernetesFault ToleranceCustom fault tolerance mechanismsBuilt-in fault tolerance with Kubernetes and RayResource ManagementManual resource allocationAutomated resource allocation and managementLoad BalancingCustom load-balancing solutionsBuilt-in load balancing with KubernetesDependency ManagementManual dependency installationConsistent environment with Docker containersCluster CoordinationComplex and manualSimplified with Kubernetes service discovery and coordinationDevelopment OverheadHigh, with custom solutions neededReduced, with Ray and Kubernetes handling many aspectsFlexibilityLimited adaptability to changing workloadsHigh flexibility with dynamic scaling and resource allocation Kubernetes provides an ideal platform for running distributed applications like Ray due to its robust orchestration capabilities. Below are the key pointers that set the value on running Ray on Kubernetes: Resource management ScalabilityOrchestrationIntegration with ecosystemEasy deployment and management KubeRay Operator makes it possible to run Ray on Kubernetes. What Is KubeRay? The KubeRay Operator simplifies managing Ray clusters on Kubernetes by automating tasks such as deployment, scaling, and maintenance. It uses Kubernetes Custom Resource Definitions (CRDs) to manage Ray-specific resources. KubeRay CRDs It has three distinct CRDs: Image source RayCluster: This CRD helps manage RayCluster’s lifecycle and takes care of AutoScaling based on the configuration defined.RayJob: It is useful when there is a one-time job you want to run instead of keeping a standby RayCluster running all the time. It creates a RayCluster and submits the job when ready. Once the job is done, it deletes the RayCluster. This helps in automatically recycling the RayCluster.RayService: This also creates a RayCluster but deploys a RayServe application on it. This CRD makes it possible to do in-place updates to the application, providing zero-downtime upgrades and updates to ensure the high availability of the application. Use Cases of KubeRay Deploying an On-Demand Model Using RayService RayService allows you to deploy models on-demand in a Kubernetes environment. This can be particularly useful for applications like image generation or text extraction, where models are deployed only when needed. Here is an example of stable diffusion. Once it is applied in Kubernetes, it will create RayCluster and also run a RayService, which will serve the model until you delete this resource. It allows users to take control of resources. Training a Model on a GPU Cluster Using RayJob RayService serves different requirements to the user, where it keeps the model or application deployed until it is deleted manually. In contrast, RayJob allows one-time jobs for use cases like training a model, preprocessing data, or inference for a fixed number of given prompts. Run Inference Server on Kubernetes Using RayService or RayJob Generally, we run our application in Deployments, which maintains the rolling updates without downtime. Similarly, in KubeRay, this can be achieved using RayService, which deploys the model or application and handles the rolling updates. However, there could be cases where you just want to do batch inference instead of running the inference servers or applications for a long time. This is where you can leverage RayJob, which is similar to the Kubernetes Job resource. Image Classification Batch Inference with Huggingface Vision Transformer is an example of RayJob, which does Batch Inferencing. These are the use cases of KubeRay, enabling you to do more with the Kubernetes cluster. With the help of KubeRay, you can run mixed workloads on the same Kubernetes cluster and offload GPU-based workload scheduling to Ray. Conclusion Distributed parallel processing offers a scalable solution for handling large-scale, resource-intensive tasks. Ray simplifies the complexities of building distributed applications, while KubeRay integrates Ray with Kubernetes for seamless deployment and scaling. This combination enhances performance, scalability, and fault tolerance, making it ideal for web crawling, data analytics, and machine learning tasks. By leveraging Ray and KubeRay, you can efficiently manage distributed computing, meeting the demands of today’s data-driven world with ease. Not only that but as our compute resource types are changing from CPU to GPU-based, it becomes important to have efficient and scalable cloud infrastructure for all sorts of applications, whether it be AI or large data processing. If you found this post informative and engaging. I'd love to hear your thoughts on this post, so do start a conversation on LinkedIn.

By Sudhanshu Prajapati
Apache Iceberg: The Open Table Format for Lakehouses and Data Streaming
Apache Iceberg: The Open Table Format for Lakehouses and Data Streaming

Every data-driven organization has operational and analytical workloads. A best-of-breed approach emerges with various data platforms, including data streaming, data lake, data warehouse and lakehouse solutions, and cloud services. An open table format framework like Apache Iceberg is essential in the enterprise architecture to ensure reliable data management and sharing, seamless schema evolution, efficient handling of large-scale datasets, and cost-efficient storage while providing strong support for ACID transactions and time travel queries. This article explores market trends; adoption of table format frameworks like Iceberg, Hudi, Paimon, Delta Lake, and XTable; and the product strategy of some of the leading vendors of data platforms such as Snowflake, Databricks (Apache Spark), Confluent (Apache Kafka/Flink), Amazon Athena, and Google BigQuery. What Is an Open Table Format for a Data Platform? An open table format helps in maintaining data integrity, optimizing query performance, and ensuring a clear understanding of the data stored within the platform. The open table format for data platforms typically includes a well-defined structure with specific components that ensure data is organized, accessible, and easily queryable. A typical table format contains a table name, column names, data types, primary and foreign keys, indexes, and constraints. This is not a new concept. Your favorite decades-old database — like Oracle, IBM DB2 (even on the mainframe) or PostgreSQL — uses the same principles. However, the requirements and challenges changed a bit for cloud data warehouses, data lakes, and lakehouses regarding scalability, performance, and query capabilities. Benefits of a "Lakehouse Table Format" Like Apache Iceberg Every part of an organization becomes data-driven. The consequence is extensive data sets, data sharing with data products across business units, and new requirements for processing data in near real-time. Apache Iceberg provides many benefits for enterprise architecture: Single storage: Data is stored once (coming from various data sources), which reduces cost and complexityInteroperability: Access without integration efforts from any analytical engineAll data: Unify operational and analytical workloads (transactional systems, big data logs/IoT/clickstream, mobile APIs, third-party B2B interfaces, etc.)Vendor independence: Work with any favorite analytics engine (no matter if it is near real-time, batch, or API-based) Apache Hudi and Delta Lake provide the same characteristics. Though, Delta Lake is mainly driven by Databricks as a single vendor. Table Format and Catalog Interface It is important to understand that discussions about Apache Iceberg or similar table format frameworks include two concepts: table format and catalog interface! As an end user of the technology, you need both! The Apache Iceberg project implements the format but only provides a specification (but not implementation) for the catalog: The table format defines how data is organized, stored, and managed within a table.The catalog interface manages the metadata for tables and provides an abstraction layer for accessing tables in a data lake. The Apache Iceberg documentation explores the concepts in much more detail, based on this diagram: Source: Apache Iceberg documentation Organizations use various implementations for Iceberg's catalog interface. Each integrates with different metadata stores and services. Key implementations include: Hadoop catalog: Uses the Hadoop Distributed File System (HDFS) or other compatible file systems to store metadata. Suitable for environments already using Hadoop.Hive catalog: Integrates with Apache Hive Metastore to manage table metadata. Ideal for users leveraging Hive for their metadata management.AWS Glue catalog: Uses AWS Glue Data Catalog for metadata storage. Designed for users operating within the AWS ecosystem.REST catalog: Provides a RESTful interface for catalog operations via HTTP. Enables integration with custom or third-party metadata services.Nessie catalog: Uses Project Nessie, which provides a Git-like experience for managing data. The momentum and growing adoption of Apache Iceberg motivates many data platform vendors to implement their own Iceberg catalog. I discuss a few strategies in the below section about data platform and cloud vendor strategies, including Snowflake's Polaris, Databricks' Unity, and Confluent's Tableflow. First-Class Iceberg Support vs. Iceberg Connector Please note that supporting Apache Iceberg (or Hudi/Delta Lake) means much more than just providing a connector and integration with the table format via API. Vendors and cloud services differentiate by advanced features like automatic mapping between data formats, critical SLAs, travel back in time, intuitive user interfaces, and so on. Let's look at an example: Integration between Apache Kafka and Iceberg. Various Kafka Connect connectors were already implemented. However, here are the benefits of using a first-class integration with Iceberg (e.g., Confluent's Tableflow) compared to just using a Kafka Connect connector: No connector configNo consumption through connectorBuilt-in maintenance (compaction, garbage collection, snapshot management)Automatic schema evolutionExternal catalog service synchronizationSimpler operations (in a fully-managed SaaS solution, it is serverless with no need for any scale or operations by the end user) Similar benefits apply to other data platforms and potential first-class integration compared to providing simple connectors. Open Table Format for a Data Lake/Lakehouse using Apache Iceberg, Apache Hudi, and Delta Lake The general goal of table format frameworks such as Apache Iceberg, Apache Hudi, and Delta Lake is to enhance the functionality and reliability of data lakes by addressing common challenges associated with managing large-scale data. These frameworks help to: Improve data management Facilitate easier handling of data ingestion, storage, and retrieval in data lakes.Enable efficient data organization and storage, supporting better performance and scalability.Ensure data consistency Provide mechanisms for ACID transactions, ensuring that data remains consistent and reliable even during concurrent read and write operations.Support snapshot isolation, allowing users to view a consistent state of data at any point in time.Support schema evolution Allow for changes in data schema (such as adding, renaming, or removing columns) without disrupting existing data or requiring complex migrations.Optimize query performance Implement advanced indexing and partitioning strategies to improve the speed and efficiency of data queries.Enable efficient metadata management to handle large datasets and complex queries effectively.Enhance data governance Provide tools for better tracking and managing data lineage, versioning, and auditing, which are crucial for maintaining data quality and compliance. By addressing these goals, table format frameworks like Apache Iceberg, Apache Hudi, and Delta Lake help organizations build more robust, scalable, and reliable data lakes and lakehouses. Data engineers, data scientists and business analysts leverage analytics, AI/ML, or reporting/visualization tools on top of the table format to manage and analyze large volumes of data. Comparison of Apache Iceberg, Hudi, Paimon, and Delta Lake I won't do a comparison of the table format frameworks Apache Iceberg, Apache Hudi, Apache Paimon, and Delta Lake here. Many experts wrote about this already. Each table format framework has unique strengths and benefits. But updates are required every month because of the fast evolution and innovation, adding new improvements and capabilities within these frameworks. Here is a summary of what I see in various blog posts about the four options: Apache Iceberg: Excels in schema and partition evolution, efficient metadata management, and broad compatibility with various data processing engines.Apache Hudi: Best suited for real-time data ingestion and upserts, with strong change data capture capabilities and data versioning.Apache Paimon: A lake format that enables building a real-time lakehouse architecture with Flink and Spark for both streaming and batch operations.Delta Lake: Provides robust ACID transactions, schema enforcement, and time travel features, making it ideal for maintaining data quality and integrity. A key decision point might be that Delta Lake is not driven by a broad community like Iceberg and Hudi, but mainly by Databricks as a single vendor behind it. Apache XTable as Interoperable Cross-Table Framework Supporting Iceberg, Hudi, and Delta Lake Users have lots of choices. XTable, formerly known as OneTable, is yet another incubating table framework under the Apache open-source license to seamlessly interoperate cross-table between Apache Hudi, Delta Lake, and Apache Iceberg. Apache XTable: Provides cross-table omnidirectional interoperability between lakehouse table formats.Is not a new or separate format. Apache XTable provides abstractions and tools for the translation of lakehouse table format metadata. Maybe Apache XTable is the answer to provide options for specific data platforms and cloud vendors while still providing simple integration and interoperability. But be careful: A wrapper on top of different technologies is not a silver bullet. We saw this years ago when Apache Beam emerged. Apache Beam is an open-source, unified model and set of language-specific SDKs for defining and executing data ingestion and data processing workflows. It supports a variety of stream processing engines, such as Flink, Spark, and Samza. The primary driver behind Apache Beam is Google, which allow the migration workflows in Google Cloud Dataflow. However, the limitations are huge, as such a wrapper needs to find the least common denominator of supporting features. And most frameworks' key benefit is the 20% that do not fit into such a wrapper. For these reasons, for instance, Kafka Streams intentionally does not support Apache Beam because it would have required too many design limitations. Market Adoption of Table Format Frameworks First of all, we are still in the early stages. We are still at the innovation trigger in terms of the Gartner Hype Cycle, coming to the peak of inflated expectations. Most organizations are still evaluating but not adopting these table formats in production across the organization yet. Flashback: The Container Wars of Kubernetes vs. Mesosphere vs. Cloud Foundry The debate round Apache Iceberg reminds me of the container wars a few years ago. The term "Container Wars" refers to the competition and rivalry among different containerization technologies and platforms in the realm of software development and IT infrastructure. The three competing technologies were Kubernetes, Mesosphere, and Cloud Foundry. Here is where it went: Cloud Foundry and Mesosphere were early, but Kubernetes still won the battle. Why? I never understood all the technical details and differences. In the end, if the three frameworks are pretty similar, it is all about: Community adoptionRight timing of feature releasesGood marketingLuckAnd a few other factors But it is good for the software industry to have one leading open-source framework to build solutions and business models on instead of three competing ones. Present: The Table Format Wars of Apache Iceberg vs. Hudi vs. Delta Lake Obviously, Google Trends is no statistical evidence or sophisticated research. But I used it a lot in the past as an intuitive, simple, free tool to analyze market trends. Therefore, I also used this tool to see if Google searches overlap with my personal experience of the market adoption of Apache Iceberg, Hudi and Delta Lake (Apache XTable is too small yet to be added): We obviously see a similar pattern as the container wars showed a few years ago. I have no idea where this is going. And if one technology wins, or if the frameworks differentiate enough to prove that there is no silver bullet, the future will show us. My personal opinion? I think Apache Iceberg will win the race. Why? I cannot argue with any technical reasons. I just see many customers across all industries talk about it more and more. And more and more vendors start supporting it. But we will see. I actually do not care who wins. However, similar to the container wars, I think it is good to have a single standard and vendors differentiating with features around it, like it is with Kubernetes. But with this in mind, let's explore the current strategy of the leading data platforms and cloud providers regarding table format support in their platforms and cloud services. Data Platform and Cloud Vendor Strategies for Apache Iceberg I won't do any speculation in this section. The evolution of the table format frameworks moves quickly, and vendor strategies change quickly. Please refer to the vendors' websites for the latest information. But here is the status quo about the data platform and cloud vendor strategies regarding the support and integration of Apache Iceberg. Snowflake: Supports Apache Iceberg for quite some time alreadyAdding better integrations and new features regularlyInternal and external storage options (with trade-offs) like Snowflake's storage or Amazon S3Announced Polaris, an open-source catalog implementation for Iceberg, with commitment to support community-driven, vendor-agnostic bi-directional integrationDatabricks: Focuses on Delta Lake as the table format and (now open sourced) Unity as catalogAcquired Tabular, the leading company behind Apache IcebergUnclear future strategy of supporting open Iceberg interface (in both directions) or only to feed data into its lakehouse platform and technologies like Delta Lake and Unity CatalogConfluent: Embeds Apache Iceberg as a first-class citizen into its data streaming platform (the product is called Tableflow)Converts a Kafka Topic and related schema metadata (i.e., data contract) into an Iceberg tableBi-directional integration between operational and analytical workloadsAnalytics with embedded serverless Flink and its unified batch and streaming API or data sharing with third-party analytics engines like Snowflake, Databricks, or Amazon AthenaMore data platforms and open-source analytics engines: The list of technologies and cloud services supporting Iceberg grows every monthA few examples: Apache Spark, Apache Flink, ClickHouse, Dremio, Starburst using Trino (formerly PrestoSQL), Cloudera using Impala, Imply using Apache Druid, FivetranCloud service providers (AWS, Azure, Google Cloud, Alibaba): Different strategies and integrations, but all cloud providers increase Iceberg support across their services these days, for instance: Object Storage: Amazon S3, Azure Data Lake Storage (ALDS), Google Cloud Storage Catalogs: Cloud-specific like AWS Glue Catalog or vendor agnostic like Project Nessie or Hive CatalogAnalytics: Amazon Athena, Azure Synapse Analytics, Microsoft Fabric, Google BigQuery Shift Left Architecture With Kafka, Flink, and Iceberg to Unify Operational and Analytical Workloads The shift left architecture moves data processing closer to the data source, leveraging real-time data streaming technologies like Apache Kafka and Flink to process data in motion directly after it is ingested. This approach reduces latency and improves data consistency and data quality. Unlike ETL and ELT, which involve batch processing with the data stored at rest, shift left architecture enables real-time data capture and transformation. It aligns with the zero-ETL concept by making data immediately usable. But in contrast to zero-ETL, shifting data processing to the left side of the enterprise architecture avoids a complex, hard-to-maintain spaghetti architecture with many point-to-point connections. Shift left architecture also reduces the need for reverse ETL by ensuring data is actionable in real-time for both operational and analytical systems. Overall, this architecture enhances data freshness, reduces costs, and speeds up the time-to-market for data-driven applications. Learn more about this concept in my blog post about "The Shift Left Architecture." Apache Iceberg as Open Table Format and Catalog for Seamless Data Sharing Across Analytics Engines An open table format and catalog introduces enormous benefits into the enterprise architecture: InteroperabilityFreedom of choice of the analytics enginesFaster time-to-marketReduced cost Apache Iceberg seems to become the de facto standard across vendors and cloud providers. However, it is still at an early stage and competing and wrapper technologies like Apache Hudi, Apache Paimon, Delta Lake, and Apache XTable are trying to get momentum, too. Iceberg and other open table formats are not just a huge win for single storage and integration with multiple analytics/data/AI/ML platforms such as Snowflake, Databricks, Google BigQuery, et al., but also for the unification of operational and analytical workloads using data streaming with technologies such as Apache Kafka and Flink. Shift left architecture is a significant benefit to reduce efforts, improve data quality and consistency, and enable real time instead of batch applications and insights. Finally, if you still wonder what the differences are between data streaming and lakehouses (and how they complement each other), check out this ten minute video: What is your table format strategy? Which technologies and cloud services do you connect? Let’s connect on LinkedIn and discuss it!

By Kai Wähner DZone Core CORE
Licenses With Daily Time Fencing
Licenses With Daily Time Fencing

Despite useful features offered by software, sometimes software pricing and packaging repel consumers and demotivate them to even take the first step of evaluation. Rarely, we have seen software/hardware used for the full 24 hours of a day but still, as a consumer, I am paying for the 24 hours of the day. At the same time, as a cloud software vendor, I know my customer is not using cloud applications for 24 hours but still, I am paying the infrastructure provider for 24 hours. On the 23rd of July, 2024, we brainstormed about the problem and identified a solution. License with daily time fencing can help consumers by offering them a cheaper license and can also help ISV in infrastructure demand forecasting and implementing eco-design. Introduction There are many scenarios where a license with daily time fencing can help. Scenario 1: Industries to Implement Eco-Design Our societies are evolving with more awareness of the impact of climate change and countries across the globe are looking for a carbon-neutral economy. This results in the demand for carbon credit-linked machine usage. To support this, machine vendors need a mechanism that allows industries to use the machine for a specified duration of a day. The machine vendor will issue a license having daily time limits to the industry. It will be computed based on how much GHG (Green House Gas) the machine produces per hour and how much carbon credit the industry has. Over time, it can be made dynamic by industry feeding carbon-credit information to machine vendors. This enables machine vendors to automatically issue a new license that enables industries to use the machine for more hours in a day. Scenario 2: BPO Working in Multiple Shifts BPOs across the globe provide 24-hour support to business users. But all centers don’t have the same number of employees. Suppose there are three centers, and each is working for 8 hours slot. BPOTime (in UTC)Employees India 00:00 to 08:00 100 Philippines 08:00 to 16:00 200 Brazil 16:00 to 00:00 50 In the above scenario, traditionally BPO purchases a 200-seat license with 24 hours daily consumption. But with daily time-fenced licenses, ISV can offer three different licenses. L1 (India) – 100 seats with daily time limits (00:00 to 08:00)L2 (Philippines) – 200 seats with daily time limits (08:00 to 16:00)L3 (Brazil) – 50 seats with daily time limits (16:00 to 00:00) Let’s compute the cost assuming the 8-hour license per seat cost is $5. Traditional license cost: (24 / 8) * (200 x 5) = $3,000New license cost: (100 + 200 + 50) x 5 = $1,750 In addition to this cost saving for a consumer, ISV will get better transparency with 350 distinct users instead of 200 users. Scenario 3: Maintenance/Support License Software consumers can purchase 24-hour support or business-hour support (9 AM to 5 PM). 24-hour support is more expensive than 8 business hours. ISV can implement a support module in their application based on the license. Scenario 4: Work-Life Balance License Work-life balance is an inescapable goal of an organization and with 24-hour available cloud software, it is getting difficult for organizations to enforce it. This is a sheer waste of resources as infrastructure is live at 100% capacity. Daily time-fenced licenses can help organizations strike a work-life balance for employees and at the same time optimize the use of office resources. Solution in Nutshell Three new fields can be introduced in a license that supports daily time fencing. DailyStartTimeDailyEndTimeDailyTimeConsumptionLimit Case 1 license with fixed time in a day with no limit on daily consumption: DailyStartTime 09:00:00 DailyEndTime 17:00:00 DailyTimeConsumptionLimit 24hrs Case 2 license with no fixed time in a day but a limit on daily consumption: DailyStartTime 00:00:00 DailyEndTime 23:59:59 DailyTimeConsumptionLimit 3hrs Case 3 license with fixed time in a day and limit on consumption as well: DailyStartTime 09:00:00 DailyEndTime 17:00:00 DailyTimeConsumptionLimit 3hrs Note: In all examples above, the license is valid for a full year (e.g., LicenseStartDate: 01-Jan-2024, LicenseEndDate: 31-Dec-2024, and above new properties are just influencing daily consumption). Flow Chart Conclusion Licensing strategies enable ISVs to expand their customer base by offering cost-effective solutions to customers in a cost-effective manner. Daily time-fenced licenses helped consumers in selecting a license that truly represent their usage (less than 24 hours). It helps ISVs in forecasting their infrastructure needs.

By Arvind Bharti
KubeVirt Implementation: Who Needs It and Why?
KubeVirt Implementation: Who Needs It and Why?

The adoption of cloud-native architectures and containerization is transforming the way we develop, deploy, and manage applications. Containers offer speed, agility, and scalability, fueling a significant shift in IT strategies. However, the reality for many organizations is that virtual machines (VMs) continue to play a critical role, especially when it comes to legacy or stateful applications. Even leading financial institutions like Goldman Sachs recognize the value of VMs alongside containerized workloads and are exploring ways to manage them efficiently. This creates a potential divide: the benefits of containerization on one side and the enduring need for VMs on the other. KubeVirt bridges this gap by extending the power of Kubernetes to virtual machine management, giving you the ability to unify your infrastructure while enabling a smoother transition to cloud-native technologies. In this article, we explore why KubeVirt is a compelling solution for organizations seeking to streamline IT operations and gain flexibility in a hybrid infrastructure environment. What Exactly is KubeVirt? KubeVirt is an open-source project that transforms Kubernetes into a powerful platform capable of managing both containers and virtual machines. Put simply, KubeVirt turns Kubernetes into a single control plane for your entire infrastructure. Here’s how it works: KubeVirt as an extension: KubeVirt adds custom resource definitions (CRDs) to Kubernetes, introducing a new object type representing virtual machines.Virtual machines as “pods”: Using KubeVirt, each VM runs within a specialized pod, which tightly integrates VMs into the Kubernetes ecosystem.Simplified VM management with Kubernetes: You can now leverage the same Kubernetes tools (kubectl and the Kubernetes API) and best practices to create, start, stop, migrate, and monitor VMs alongside your containerized workloads. Think of KubeVirt as enabling Kubernetes to speak the language of virtualization, opening a world of possibilities for your infrastructure management. The Business Impact of KubeVirt KubeVirt delivers tangible benefits that go beyond technical elegance. By adopting KubeVirt, your organization stands to gain the following: Seamless workload management: Break down the walls between your VM-based applications and your containerized workloads. Manage everything from a single platform using the same tools and processes, significantly simplifying operations and reducing complexity.Enhanced resource efficiency: KubeVirt empowers you to run both traditional VMs and containers on the same underlying cluster hardware. Optimize resource utilization, improve infrastructure density, and potentially realize significant cost savings.Accelerated modernization: Legacy applications tied to VMs don’t have to be a roadblock to innovation. KubeVirt provides a gradual pathway to modernizing applications at your own pace. You can containerize and migrate components over time, all within the same Kubernetes environment, minimizing disruption.Future-proof infrastructure: By investing in KubeVirt, you align with cloud-native principles and position Kubernetes as the backbone of your technology stack. This fosters flexibility and agility, enabling you to adapt readily to evolving business requirements. Why Should Your Organization Care? KubeVirt delivers compelling value, especially in these areas: IT Teams and DevOps: KubeVirt simplifies operations by providing a unified control plane for all your workloads. It lets you streamline workflows, reduce tooling overhead, and improve overall team efficiency.Executives: Gain operational flexibility and achieve cost reductions and a streamlined path toward infrastructure modernization. KubeVirt aligns technology investments with long-term business success.Mixed workloads: If you’re managing both legacy VM-based applications and modern containerized deployments, KubeVirt is essential. It lets you avoid vendor lock-in, minimize complexity, and maintain full control over your infrastructure choices. Here are some specific pain points that KubeVirt addresses: Frustrated with managing separate environments for VMs and containers? KubeVirt brings them together, making management far easier.Seeking flexibility without compromising on existing investments? KubeVirt lets you leverage your VM infrastructure while modernizing.Want to improve cost efficiency and resource usage? KubeVirt helps consolidate workloads for better utilization.Struggling with complex migrations of legacy apps? Modernize incrementally and control your pace with KubeVirt. Getting Started: Deployment and Implementation Deploying KubeVirt requires a well-prepared Kubernetes environment. This section provides a detailed guide to help you set up and implement KubeVirt in your infrastructure. Prerequisites Before you begin, ensure the following requirements are met: 1. Kubernetes Cluster You need a Kubernetes cluster (or a derivative such as OpenShift) based on one of the latest three Kubernetes releases available at the time of the KubeVirt release. 2. Kubernetes API Server Configuration The Kubernetes API server must be configured with --allow-privileged=true to run KubeVirt's privileged DaemonSet. 3. Kubectl Utility Ensure you have the kubectl client utility installed and configured to interact with your cluster. 4. Container Runtime Support KubeVirt is supported on the following container runtimes: Containerdcrio (with runv) Other container runtimes should work as well, but the mentioned ones are the primary targets. 5. Hardware Virtualization Hardware with virtualization support is recommended. You can use virt-host-validate to ensure your hosts are capable of running virtualization workloads: YAML $ virt-host-validate qemu Network and Security Considerations Network Configuration: Plan how your VMs will connect and interact with external networks and the rest of your Kubernetes environment. AppArmor Integration: On systems with AppArmor enabled, you might need to modify the AppArmor profiles to allow the execution of KubeVirt-privileged containers. For example: YAML # vim /etc/apparmor.d/usr.sbin.libvirtd /usr/libexec/qemu-kvm PUx, # apparmor_parser -r /etc/apparmor.d/usr.sbin.libvirtd KubeVirt Installation To install KubeVirt, follow these steps: 1. Install the KubeVirt Operator The KubeVirt operator simplifies the installation and management of KubeVirt components. Run the following commands to deploy the latest KubeVirt release: YAML # Point at latest release $ export RELEASE=$(curl https://storage.googleapis.com/kubevirt-prow/release/kubevirt/kubevirt/stable.txt) # Deploy the KubeVirt operator $ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml # Create the KubeVirt CR (instance deployment request) which triggers the actual installation $ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-cr.yaml # Wait until all KubeVirt components are up $ kubectl -n kubevirt wait kv kubevirt --for condition=Available 2. Configuration for Non-Hardware Virtualization If hardware virtualization is not available, enable software emulation by setting useEmulation to true in the KubeVirt CR: YAML $ kubectl edit -n kubevirt kubevirt kubevirt # Add the following to the kubevirt.yaml file spec: configuration: developerConfiguration: useEmulation: true Implementation Best Practices To get the most out of KubeVirt, follow these best practices: 1. Conduct a Workload Assessment Prioritize VMs that are suitable for containerization. Start with less mission-critical applications to gain practical experience. 2. Assess Networking and Storage Plan how to bridge VM networking with your existing Kubernetes networking and integrate storage solutions for persistent data using Container Storage Interface (CSI) plugins. 3. Emphasize Monitoring and Management Use Kubernetes monitoring tools or explore KubeVirt-specific solutions to gain visibility into VM performance alongside your containers. 4. Live Migration Enable and configure live migration to move running VMs to other compute nodes without downtime. This involves setting feature gates and configuring migration-specific parameters in the KubeVirt CR: YAML apiVersion: kubevirt.io/v1 kind: KubeVirt metadata: name: kubevirt namespace: kubevirt spec: configuration: developerConfiguration: featureGates: - LiveMigration migrations: parallelMigrationsPerCluster: 5 parallelOutboundMigrationsPerNode: 2 bandwidthPerMigration: 64Mi completionTimeoutPerGiB: 800 progressTimeout: 150 Example Installation on OpenShift (OKD) If you're using OKD, additional steps include configuring Security Context Constraints (SCC): YAML $ oc adm policy add-scc-to-user privileged -n kubevirt -z kubevirt-operator Example Installation on k3OS For k3OS, ensure you load the required modules on all nodes before deploying KubeVirt: YAML k3os: modules: - kvm - vhost_net Restart the nodes with this configuration and then deploy KubeVirt as described above. Installation of Daily Developer Builds For the latest developer builds, run: YAML $ LATEST=$(curl -L https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/latest) $ kubectl apply -f https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/${LATEST}/kubevirt-operator.yaml $ kubectl apply -f https://storage.googleapis.com/kubevirt-prow/devel/nightly/release/kubevirt/kubevirt/${LATEST}/kubevirt-cr.yaml Deployment From Source By following these steps and best practices, you can ensure a smooth and successful KubeVirt implementation, providing a unified infrastructure management solution that leverages both virtual machines and containers. Implementation Best Practices Follow these best practices to get the most out of KubeVirt: Conduct a workload assessment: Not every application is immediately suitable for KubeVirt. Prioritize VMs with good potential for containerization. Less mission-critical applications can be a great way to gain practical experience while minimizing risk.Assess networking and storage: Carefully consider how to bridge VM networking with your existing Kubernetes networking. Plan storage integration for persistent data using solutions like Container Storage Interface (CSI) plugins.Emphasize monitoring and management: Adapt your existing Kubernetes monitoring tools or explore KubeVirt-specific solutions to gain visibility into VM performance alongside your containers. Conclusion KubeVirt offers a compelling path for organizations seeking to reap the benefits of cloud-native technologies while maximizing the value of existing virtual machine investments. It boosts operational efficiency, fosters flexibility, and accelerates your modernization journey.

By Raza Shaikh
Driving RAG-Based AI Infrastructure
Driving RAG-Based AI Infrastructure

Large language models (LLMs) have transformed AI with their ability to process and generate human-like text. However, their static pre-trained knowledge presents challenges for dynamic, real-time tasks requiring current information or domain-specific expertise. Retrieval-augmented generation (RAG) addresses these limitations by integrating LLMs with external data sources. When paired with AI agents that orchestrate workflows, RAG-based infrastructure becomes a powerful tool for real-time decision-making, analytics, and automation. System Architecture The architecture of a RAG-based AI system includes several core components: User Interaction Layer: This is the interface where users input queries. It can range from chatbots to APIs. The input is processed for downstream components. For example, in an enterprise setting, a user might request the latest compliance updates.Query Preprocessing and Embedding Generation: The input is tokenized and converted into a vectorized format using models like OpenAI’s Ada or Hugging Face Transformers. These embeddings capture semantic meaning, making it easier to match with relevant data.Vector Database for Retrieval: A vector database like Pinecone or FAISS stores pre-indexed embeddings of documents. It retrieves the most relevant information by comparing query embeddings with stored embeddings. For example, a legal assistant retrieves specific GDPR clauses based on user queries.LLM for Contextualization: Retrieved data is fed into an LLM, which synthesizes the information to generate responses. Models such as GPT-4 or Claude can create summaries, detailed explanations, or execute logic-based tasks.Agent Orchestration Layer: AI agents act as managers that sequence tasks and integrate with APIs, databases, or tools. For example, a financial agent might retrieve transaction data, analyze patterns, and trigger alerts for anomalies.Feedback and Optimization: The system collects feedback on responses and incorporates it into learning loops, improving relevance over time. Techniques such as Reinforcement Learning from Human Feedback (RLHF) and fine-tuning help refine the system. Proposed Architecture Trade-Offs Pros Dynamic knowledge updates: By retrieving data from live sources, RAG ensures responses are current and accurate. For example, medical systems retrieve updated clinical guidelines for diagnostics.Scalability: Modular components allow scaling with workload by adding resources to vector databases or deploying additional LLM instances.Task automation: Orchestrated agents streamline multi-step workflows like data validation, content generation, and decision-making.Cost savings: External retrieval reduces the need for frequent LLM retraining, lowering compute costs. Cons Latency: Integration of multiple components like vector databases and APIs can lead to response delays, especially with high query volumes.Complexity: Maintaining and debugging such a system requires expertise in LLMs, retrieval systems, and distributed workflows.Dependence on data quality: Low-quality or outdated indexed data leads to suboptimal results.Security risks: Handling sensitive data across APIs and external sources poses compliance challenges, particularly in regulated industries. Case Studies 1. Fraud Detection in Banking A RAG-based system retrieves known fraud patterns from a vector database and analyzes real-time transactions for anomalies. If a match is detected, an AI agent escalates the case for review, enhancing financial security. 2. Legal Document Analysis Legal assistants leverage LLMs with RAG to extract key clauses and flag potential risks in contracts. Indexed legal databases enable quick retrieval of precedent cases or regulatory guidelines, reducing manual review time. 3. Personalized Learning In education, AI agents generate personalized lesson plans by retrieving resources from academic databases based on a student’s performance. The LLM contextualizes this information, offering customized recommendations for improvement. Conclusion RAG-based AI infrastructure powered by LLMs and AI agents bridges the gap between static pre-trained knowledge and dynamic, real-time requirements. At the same time, the system's complexity and data dependencies present challenges, its ability to integrate live data and automate workflows makes it invaluable in applications like finance, healthcare, and education. With advancements in frameworks like LangChain and Pinecone, the adoption of RAG-based systems is poised to grow, delivering smarter, context-aware solutions.

By Apurva Kumar
AWS Performance Tuning: Why EC2 Autoscaling Isn’t a Silver Bullet
AWS Performance Tuning: Why EC2 Autoscaling Isn’t a Silver Bullet

AWS EC2 Autoscaling is frequently regarded as the ideal solution for managing fluctuating workloads. It offers automatic adjustments of computing resources in response to demand, theoretically removing the necessity for manual involvement. Nevertheless, depending exclusively on EC2 Autoscaling can result in inefficiencies, overspending, and performance issues. Although Autoscaling is an effective tool, it does not serve as a one-size-fits-all remedy. Here’s a comprehensive exploration of why Autoscaling isn’t a guaranteed fix and suggestions for engineers to improve its performance and cost-effectiveness. The Allure of EC2 Autoscaling Autoscaling groups (ASGs) dynamically modify the number of EC2 instances to align with your application’s workload. This feature is ideal for unpredictable traffic scenarios, like a retail site during a Black Friday rush or a media service broadcasting a live event. The advantages are evident: Dynamic scaling: Instantly adds or removes instances according to policies or demand.Cost management: Shields against over-provisioning in low-traffic times.High availability: Guarantees that applications stay responsive during peak load. Nonetheless, these benefits come with certain limitations. The Pitfalls of Blind Reliance on Autoscaling 1. Cold Start Delays Autoscaling relies on spinning up new EC2 instances when demand increases. This process involves: Booting up a virtual machine.Installing or configuring necessary software.Connecting the instance to the application ecosystem. In many cases, this can take several minutes — an eternity during traffic spikes. For example: An e-commerce platform experiencing a flash sale might see lost sales and frustrated customers while waiting for new instances to come online.A real-time analytics system could drop critical data points due to insufficient compute power during a sudden surge. Solution: Pre-warm instances during expected peaks or use predictive scaling based on historical patterns. 2. Inadequate Load Balancing Even with Autoscaling in place, improperly configured load balancers can lead to uneven traffic distribution. For instance: A health-check misconfiguration might repeatedly route traffic to instances that are already overloaded.Sticky sessions can lock users to specific instances, negating the benefits of new resources added by Autoscaling. Solution: Pair Autoscaling with robust load balancer configurations, such as application-based routing and failover mechanisms. 3. Reactive Nature of Autoscaling Autoscaling policies are inherently reactive — they respond to metrics such as CPU utilization, memory usage, or request counts. By the time the system recognizes the need for additional instances, the spike has already impacted performance. Example: A fintech app processing high-frequency transactions saw delays when new instances took 5 minutes to provision. This lag led to compliance violations during market surges. Solution: Implement predictive scaling using AWS Auto Scaling Plans or leverage AWS Lambda for instantaneous scaling needs where possible. 4. Costs Can Spiral Out of Control Autoscaling can inadvertently cause significant cost overruns: Aggressive scaling policies may provision more resources than necessary, especially during transient spikes.Overlooked instance termination policies might leave idle resources running longer than intended. Example: A SaaS platform experienced a 300% increase in cloud costs due to Autoscaling misconfigurations during a product launch. Instances remained active long after the peak traffic subsided. Solution: Use AWS Cost Explorer to monitor spending and configure instance termination policies carefully. Consider Reserved or Spot Instances for predictable workloads. Enhancing Autoscaling for Real-World Efficiency To overcome these challenges, Autoscaling must be part of a broader strategy: 1. Leverage Spot and Reserved Instances Use a mix of Spot, Reserved, and On-Demand Instances. For example, Reserved Instances can handle baseline traffic, while Spot Instances handle bursts, reducing costs. 2. Combine With Serverless Architectures Serverless services like AWS Lambda can absorb sudden, unpredictable traffic bursts without the delay of provisioning EC2 instances. For instance, a news website might use Lambda to serve spikes in article views after breaking news. 3. Implement Predictive Scaling AWS’s predictive scaling uses machine learning to forecast traffic patterns. A travel booking site, for example, could pre-scale instances before the surge in bookings during holiday seasons. 4. Optimize Application Performance Sometimes the root cause of scaling inefficiencies lies in the application itself: Inefficient code.Database bottlenecks.Overuse of I/O operations.Invest in application profiling tools like Amazon CloudWatch and AWS X-Ray to identify and resolve these issues. The Verdict EC2 Autoscaling is an essential component of modern cloud infrastructure, but it’s not a perfect solution. Cold start delays, reactive scaling, and cost overruns underscore the need for a more holistic approach to performance tuning. By combining Autoscaling with predictive strategies, serverless architectures, and rigorous application optimization, organizations can achieve the scalability and cost-efficiency they seek. Autoscaling is an impressive tool, but like any tool, it’s most effective when wielded thoughtfully. For engineers, the challenge is not whether to use Autoscaling but how to use it in harmony with the rest of the AWS ecosystem.

By John Akkarakaran Jose
The Art of Prompt Engineering in Incident Response
The Art of Prompt Engineering in Incident Response

In the rapidly evolving field of Incident Response (IR), prompt engineering has become an essential skill that leverages AI to streamline processes, enhance response times, and provide deeper insights into threats. By creating precise and targeted prompts, IR teams can effectively utilize AI to triage alerts, assess threats, and even simulate incident scenarios, bringing significant value to cybersecurity operations. This article explores the foundations, benefits, and best practices for mastering prompt engineering in Incident Response, shedding light on how this practice is reshaping the field. What Is Prompt Engineering in Incident Response? Prompt engineering in the context of IR is the art and science of crafting highly specific, structured instructions for AI systems to guide them through various stages of incident management, from detection and assessment to remediation and post-incident analysis. Unlike conventional IR processes that rely on human input alone, prompt engineering allows IR teams to harness AI’s analytical power to accelerate workflows and provide more data-driven responses to threats. The goal of prompt engineering in IR is to ensure clarity and precision, enabling AI to focus on relevant aspects of an incident, filter out unnecessary information, and support the decision-making processes of IR professionals. With well-designed prompts, AI can sift through large volumes of data and present only the most critical insights, making it a powerful tool for handling the high volume and velocity of threats that security teams face daily. Benefits of Prompt Engineering in IR Prompt engineering provides numerous advantages that make it especially useful for IR teams operating under time constraints and high pressure. Here’s a look at some of its core benefits: Enhanced Speed and Efficiency With tailored prompts, AI systems can automate tasks such as analyzing network traffic, triaging alerts, or identifying key indicators of compromise (IOCs). This automation frees up IR teams to focus on complex and high-priority incidents that require human judgment and expertise. Improved Accuracy and Consistency Prompt engineering reduces human error by enabling consistent responses across similar incidents. Standardized prompts ensure that incidents are handled uniformly, which is critical for maintaining the integrity of response protocols and meeting compliance standards. Scalability As organizations face an increasing number of threats, prompt engineering allows IR teams to scale their operations. By automating the initial phases of incident handling, prompt engineering makes it possible to manage a higher volume of alerts without sacrificing quality. Informed Decision-Making AI-driven insights can assist IR teams in making faster, more informed decisions. For example, AI can rapidly analyze logs or network traffic to pinpoint unusual patterns, giving security professionals a comprehensive view of the threat landscape. Components of Effective Prompt Engineering in Incident Response Creating an effective prompt for incident response requires a deep understanding of both the AI model’s capabilities and the specific needs of the incident. Here are several essential components to consider: Contextual Relevance It’s essential to provide context in prompts so that the AI system understands the scope and focus of the incident. For example, instead of a vague instruction like “identify threats,” a prompt should specify “identify all external IP addresses involved in brute forcing attempts within the last 24 hours.” Operational Constraints Including specific constraints helps narrow down the AI’s analysis to the most relevant data. A prompt might specify constraints like timeframes, log types, or data sources; e.g., “analyze anomalies in login attempts between midnight and 6 a.m.” Iterative Refinement Prompt engineering is rarely perfect on the first attempt. Using feedback loops to refine prompts based on the accuracy and relevance of AI responses can significantly improve results. This iterative approach allows for continuous optimization, ensuring the prompts remain aligned with the incident context. Risk Prioritization IR teams often need to address high-risk incidents first. Prompts that instruct the AI to prioritize certain conditions, such as “highlight critical alerts involving unauthorized data access,” help ensure that the most significant threats are identified and addressed promptly. Strategies for Crafting Effective Prompts in Incident Response The quality of a prompt directly affects the AI’s output, so it’s crucial to approach prompt engineering strategically. Here are some proven strategies: Providing Identity to a Prompt AI provides better and more consistent results when you provide the application with an identity or role they can take while analyzing the data and provided prompt. For example: "Assume you are an investigator." Being Specific, but Not Overly Restrictive While specificity is essential, overly restrictive prompts can limit the AI’s ability to detect relevant insights. For instance, instead of simply requesting “list errors in server logs,” a more effective prompt would be: “identify significant error codes related to failed logins in auth logs.” This approach gives the AI clear guidance without unnecessary restrictions. Using Layered Prompts for Complex Incidents For incidents involving multiple phases or indicators, it can be effective to use layered prompts. Start with a general analysis, and then refine subsequent prompts based on initial findings. For example, an initial prompt could be “identify any IP addresses with repeated failed login attempts,” followed by a second prompt focusing on specific details, such as the geographic location of those IPs or looking up those IPs on any intel platform. Leveraging Hypothetical Scenarios By using scenario-based prompts, IR teams can simulate incident conditions to anticipate potential outcomes. For example, a prompt like “analyze potential escalation paths if malware is detected on this server” can provide insights that inform preemptive response planning. Refining Prompts With Pre-Defined Criteria Defining specific criteria within the prompt ensures the AI focuses on critical elements of the incident. A prompt might ask, “focus on recent IP addresses associated with failed login attempts outside business hours,” helping the AI prioritize meaningful patterns over irrelevant data. Chain of Thought Prompt This technique involves asking the AI to think about a structured argument and think through the process of how a particular task can be resolved. Using this method, AI will think through all the details that can be looked for in a particular ask. For example, "Analyze this email for any phishing or spam content. Describe your reasoning in steps." Examples of Prompt Engineering in IR Scenarios To illustrate how prompt engineering works in practice, consider the following examples: Scenario 1: Identifying Suspicious User Behavior Prompt “Analyze the login patterns over the last 48 hours for User 'pwned' in this SSH audit log. Identify unusual IP addresses and multiple failed attempts for this user.” Outcome The outcome of this query will be step-by-step results of how the logs were analyzed, queries used by LLM, suspicious IPs, and brute force attempts observed. Sample Partial Output Explanation of the Log Patterns Repeated Failed Attempts: The log entries show multiple failed login attempts from three distinct IPs (192.168.xx.xx, 203.0.xx.xx, 198.51.xx.xx) within short time intervals, simulating a brute-force attack pattern.Accepted Logins: Intermittent successful login attempts, particularly from the same IP addresses used in failed attempts, could indicate that the attacker finally succeeded in guessing the password after multiple failed attempts.Unusual IP Addresses: The IP addresses used can be flagged for being unusual if they’re not typically associated with user "pwned." Scenario 2: Detecting Phishing Patterns Prompt “Examine email headers, URLs, and sender domains in the last five reported phishing attempts. Identify recurring patterns or compromise indicators.” Outcome By isolating phishing indicators, AI can assist IR teams in preemptively recognizing and mitigating similar attacks. Sample Prompt and Results Assume you are a security engineer. Analyze this email for any phishing or spam content. Describe what was analyzed. Key Analysis Points: 1. Sender Verification: • The email claims to be from LastPass, a well-known password manager, with the sender’s address listed as replies@m.lastpass.com. • It includes DKIM (DomainKeys Identified Mail) and SPF (Sender Policy Framework) authentication, both of which have passed, suggesting that the email is genuinely coming from an authorized LastPass sender domain and hasn’t been tampered with. This reduces, though doesn’t eliminate, the likelihood of phishing. 2. Link Analysis: • Multiple links in the email direct users to what appear to be LastPass-related domains, but they are routed through the domain lnk.lastpass.com, which is a link-tracking service commonly used in legitimate marketing emails. • However, it is essential to hover over links (without clicking) to verify that they resolve to recognizable and legitimate LastPass URLs. If they divert to unknown domains, that would be a red flag for phishing. 3. Content and Intent: • The email promotes a 20% discount for upgrading LastPass services, a common incentive tactic. It also emphasizes cybersecurity tips, which aligns with LastPass’s brand and messaging, especially around Cybersecurity Awareness Month. • The call-to-action buttons (“See pricing now,” “Check your score now”) are positioned to encourage user engagement, typical in legitimate marketing. Still, they could also be used in phishing emails to gather credentials or download malicious files. 4. Unsubscribe Option and Fine Print: • Legitimate marketing emails typically include unsubscribe links. This email contains a “List-Unsubscribe” option, which is standard in compliant marketing emails, reducing the chance of spam but requiring further validation. 5. Technical Headers and Anti-Spam Signals: • Anti-spam indicators, including ARC (Authenticated Received Chain) headers, show compliance with security practices, supporting legitimacy. However, this can sometimes be replicated by well-crafted phishing campaigns. Conclusion Given the legitimate signs (SPF, DKIM, known LastPass link domain, unsubscribe options), this email likely originates from LastPass and is a standard promotional communication rather than spam or phishing. However, users should still verify links manually and avoid clicking if unsure. Challenges and Solutions in Prompt Engineering for IR Despite its potential, prompt engineering in IR also presents challenges that require careful consideration: Overfitting prompts: Overly narrow prompts can limit AI’s ability to generalize insights to new or unexpected incidents. IR teams should consider using adaptable templates that can be adjusted for various incident types while still maintaining a level of specificity.Maintaining context awareness: AI models can sometimes lose context over extended interactions, producing outputs that veer off-topic. To address this, IR teams can structure prompts to periodically summarize key findings, ensuring AI remains focused on the incident’s primary context.Balancing automation with human expertise: While prompt engineering can automate many IR tasks, it’s critical to maintain human oversight. Effective prompts should guide AI to supplement analysts’ expertise rather than replace it, ensuring that incident response decisions are always well-informed.Getting consistent results: One significant downside of using prompts in Incident Response (IR) is the lack of consistency in results. This inconsistency can stem from several underlying factors, each of which impacts the reliability and trustworthiness of AI-driven incident response tasks. Things to Note As AI assumes a more central role in IR, prompt engineering will need to incorporate ethical safeguards to ensure responsible AI deployment, particularly for sensitive cases that involve privacy or regulatory compliance. Security engineers should always think about what data is being passed onto the AI system and not compromise any critical information. Key Risks and Challenges However, the use of prompt engineering in incident response also introduces several risks: Malicious prompt injections: Adversaries could potentially insert malicious prompts into the AI systems used for incident response, which could cause those systems to produce flawed analyses or take harmful actions. This vulnerability is similar to SQL injection attacks, and can only be effectively addressed through the implementation of rigorous input validation measures.Data exposure: Poorly constructed prompts might inadvertently cause AI systems to reveal sensitive information about an organization's security posture or incident details.Over-reliance on AI: There's a risk that security teams may become overly dependent on AI-generated responses, potentially missing nuanced aspects of an incident that require human expertise.Accuracy and bias: AI models can produce inaccurate or biased results if not properly trained or if working with incomplete data, which could lead to misguided incident response actions. Mitigation Strategies To address these risks, organizations should consider the following approaches: Input validation: Implement strict input sanitization and validation for all prompts used in incident response systems.Layered defense: Employ a multi-faceted approach combining input validation, anomaly detection, and output verification to protect against prompt injection and other AI-related vulnerabilities.Human oversight: Maintain human review and approval for critical incident response decisions, using AI as a support tool rather than a replacement for expert judgment.Regular auditing: Conduct frequent audits of AI models and prompts used in incident response to identify potential biases or inaccuracies.Secure environment: For handling sensitive internal information, use controlled environments like Azure OpenAI or Vertex AI rather than public AI services.Continuous training: Regularly update and fine-tune AI models with the latest threat intelligence and incident response best practices. Conclusion The art of prompt engineering in Incident Response is more than just a technical skill: it is a strategic capability that empowers IR teams to harness AI for faster, more accurate, and more consistent responses to cybersecurity threats. Through precision-crafted prompts and continuous refinement, prompt engineering can streamline workflows, improve decision-making, and ultimately enhance an organization’s resilience against a wide range of threats. As the field continues to evolve, mastering prompt engineering will be essential for building a responsive, efficient, and resilient IR landscape. By embracing this practice, IR professionals can make better use of AI tools, transforming incident response into a more proactive, agile, and data-driven discipline.

By Dimple Gajra

Top Maintenance Experts

expert thumbnail

Shai Almog

OSS Hacker, Developer Advocate and Entrepreneur,
Codename One

Software developer with ~30 years of professional experience in a multitude of platforms/languages. JavaOne rockstar/highly rated speaker, author, blogger and open source hacker. Shai has extensive experience in the full stack of backend, desktop and mobile. This includes going all the way into the internals of VM implementation, debuggers etc. Shai started working with Java in 96 (the first public beta) and later on moved to VM porting/authoring/internals and development tools. Shai is the co-founder of Codename One, an Open Source project allowing Java developers to build native applications for all mobile platforms in Java. He's the coauthor of the open source LWUIT project from Sun Microsystems and has developed/worked on countless other projects both open source and closed source. Shai is also a developer advocate at Lightrun.

The Latest Maintenance Topics

article thumbnail
Infrastructure as Code (IaC) Beyond the Basics
IaC has matured beyond basic scripting to offer scalable, secure cloud ops with reusable modules, testing, policy-as-code, and built-in cost optimization.
May 16, 2025
by Neha Surendranath
· 858 Views
article thumbnail
The Full-Stack Developer's Blind Spot: Why Data Cleansing Shouldn't Be an Afterthought
Full-stack developers often focus on clean code but neglect clean data, leading to performance issues, security vulnerabilities, and frustrated users.
May 16, 2025
by Farah Kim
· 705 Views
article thumbnail
Data Quality: A Novel Perspective for 2025
New testing techniques, smarter anomaly detection, and multi-cloud strategies are improving data reliability. Advanced tools are revolutionizing data quality management.
May 16, 2025
by Srinivas Murri
· 608 Views
article thumbnail
Optimize Deployment Pipelines for Speed, Security and Seamless Automation
Automated deployment strategies boost speed, security, and reliability, ensuring faster, risk-free releases with zero downtime.
May 2, 2025
by Bal Reddy Cherlapally
· 3,168 Views · 1 Like
article thumbnail
How Platform Engineering Is Impacting Infrastructure Automation
Platform engineering is introducing reusable, standardized internal platforms that streamline workflows and reduce manual overhead.
May 1, 2025
by Mariusz Michalowski
· 1,664 Views
article thumbnail
On-Call That Doesn’t Suck: A Guide for Data Engineers
Data pipelines don’t fail silently; they make a lot of noise. The question is, are you listening to the signal or drowning in noise?
April 29, 2025
by Tulika Bhatt
· 1,926 Views · 1 Like
article thumbnail
How to Build the Right Infrastructure for AI in Your Private Cloud
Build scalable infrastructure with GPUs for AI workloads, manage data pipelines efficiently, and ensure security and compliance.
April 25, 2025
by Siva Kiran Nandipati
· 5,044 Views · 1 Like
article thumbnail
CRDTs Explained: How Conflict-Free Replicated Data Types Work
Explore Conflict-free Replicated Data Types. Data structure designed to ensure that data on different replicas will eventually converge into a consistent state.
April 24, 2025
by Bartłomiej Żyliński DZone Core CORE
· 4,929 Views · 2 Likes
article thumbnail
Securing Your Infrastructure and Services During the Distribution Phase
Scan images, manifests, and sign the artifacts to ensure integrity and trust while distributing and deploying your services.
April 21, 2025
by Siri Varma Vegiraju DZone Core CORE
· 6,799 Views · 1 Like
article thumbnail
Why I Built the Ultimate Text Comparison Tool (And Why You Should Try It)
Learn how my comprehensive text comparison tool combines exact, fuzzy, and phonetic matching to solve your messiest data reconciliation challenges in minutes.
April 17, 2025
by Mokhtar Ebrahim
· 3,687 Views · 3 Likes
article thumbnail
Optimus Alpha Analyzes Team Data
Learn how Optimus Alpha crafts data-driven retrospective formats to boost agile value creation and bridge technical debt with team autonomy.
April 14, 2025
by Stefan Wolpers DZone Core CORE
· 2,668 Views · 1 Like
article thumbnail
Shifting Left: A Culture Change Plan for Early Bug Detection
Shift-left empowers developers to catch bugs early, easing QA overload, accelerating releases, cutting costs, and significantly raising overall software quality.
April 8, 2025
by Mukund Wagh
· 3,382 Views · 2 Likes
article thumbnail
Why Clean Data Is the Foundation of Successful AI Systems
Poor data quality costs enterprises $406M annually. Learn in this article some key challenges and best practices for ensuring data quality in AI systems.
April 8, 2025
by Vaishali Mishra
· 3,712 Views · 1 Like
article thumbnail
Implementing Infrastructure as Code (IaC) for Data Center Management
Learn the benefits of infrastructure as code (IaC) for data center management and the best strategies for implementing it.
April 7, 2025
by Zac Amos
· 3,412 Views · 2 Likes
article thumbnail
Creating a Web Project: Refactoring
See how to approach refactoring as a strategic investment in your codebase. Learn best practices, when not to refactor, and how to use automated tools and metrics to guide your efforts.
April 1, 2025
by Filipp Shcherbanich DZone Core CORE
· 3,092 Views · 4 Likes
article thumbnail
The Art of Postmortem
Top tech companies have a meticulous post-mortem process for analyzing outages. In this article, we shed light on the art of writing a good post-mortem report.
March 28, 2025
by Aditya Visweswaran
· 3,167 Views · 3 Likes
article thumbnail
The Future of DevOps
Learn about Infrastructure as Code (IaC) predictions for 2025, from AI-driven drift management to cost optimization, platform engineering, and multi-framework trends.
March 27, 2025
by Omry Hay
· 3,149 Views · 3 Likes
article thumbnail
Ensuring Data Quality With Great Expectations and Databricks
Ensure data quality in pipelines with Great Expectations. Learn to integrate with Databricks, validate data, and automate checks for reliable datasets.
March 26, 2025
by Sairamakrishna BuchiReddy Karri
· 2,858 Views · 1 Like
article thumbnail
Top Terraform and OpenTofu Tools to Use in 2025
Explore the top Terraform and OpenTofu tools for 2025 to simplify infrastructure management, improve collaboration, boost security, and optimize workflows.
March 17, 2025
by Alexander Sharov
· 3,330 Views · 3 Likes
article thumbnail
Non-Project Backlog Management for Software Engineering Teams
This article examines implementation guidelines for managing non-project backlogs like technical debt, bugs, and incomplete documentation.
March 6, 2025
by Nikhil Kapoor
· 4,533 Views · 1 Like
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • ...
  • Next

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: