The latest and popular trending topics on DZone.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Java is an object-oriented programming language that allows engineers to produce software for multiple platforms. Our resources in this Zone are designed to help engineers with Java program development, Java SDKs, compilers, interpreters, documentation generators, and other tools used to produce a complete application.
JavaScript (JS) is an object-oriented programming language that allows engineers to produce and implement complex features within web browsers. JavaScript is popular because of its versatility and is preferred as the primary choice unless a specific function is needed. In this Zone, we provide resources that cover popular JS frameworks, server applications, supported data types, and other useful topics for a front-end engineer.
Open source refers to non-proprietary software that allows anyone to modify, enhance, or view the source code behind it. Our resources enable programmers to work or collaborate on projects created by different teams, companies, and organizations.
Twenty Things Every Java Software Architect Should Know
Using a Body With an HTTP Get Method Is Still a Bad Idea
In my previous article, I took a closer look at the Java ExecutorService interface and its implementations, with some focus on the Fork/Join framework and ThreadPerTaskExecutor. Today, I would like to take a step forward and check how well they behave when put under pressure. In short, I am going to make benchmarks, a lot of benchmarks. All the code from below, and more, will be available in a dedicated GitHub repository. Logic Under Benchmark I would like to start this text with a walk through the logic that will be the base for benchmarks as it is split into two basic categories: Based on the classic stream Based on the Fork/Join approach Classic Stream Logic Java public static Map<Ip, Integer> groupByIncomingIp(Stream<String> requests, LocalDateTime upperTimeBound, LocalDateTime lowerTimeBound) { return requests .map(line -> line.split(",")) .filter(words -> words.length == 3) .map(words -> new Request(words[1], LocalDateTime.parse(words[2]))) .filter(request -> request.timestamp().isBefore(upperTimeBound) && request.timestamp().isAfter(lowerTimeBound)) .map(i -> new Ip(i.ip())) .collect(groupingBy(i -> i, summingInt(i -> 1))); } In theory, the purpose of this piece of code is to transform a list of strings, then do some filtering and grouping around and return the map. Supplied strings are in the following format: 1,192.168.1.1,2023-10-29T17:33:33.647641574 It represents the event of reading an IP address trying to access a particular server. The output maps an IP address to the number of access attempts in a particular period, expressed by lower and upper time boundaries. Fork/Join Logic Java @Override public Map<Ip, Integer> compute() { if (data.size() >= THRESHOLD) { Map<Ip, Integer> output = new HashMap<>(); ForkJoinTask .invokeAll(createSubTasks()) .forEach(task -> task .join() .forEach((k, v) -> updateOutput(k, v, output)) ); return output; } return process(); } private void updateOutput(Ip k, Integer v, Map<Ip, Integer> output) { Integer currentValue = output.get(k); if (currentValue == null) { output.put(k, v); } else { output.replace(k, currentValue + v); } } private List<ForkJoinDefinition> createSubTasks() { int size = data.size(); int middle = size / 2; return List.of( new ForkJoinDefinition(new ArrayList<>(data.subList(0, middle)), now), new ForkJoinDefinition(new ArrayList<>(data.subList(middle, size)), now) ); } private Map<Ip, Integer> process() { return groupByIncomingIp(data.stream(), upperTimeBound, lowerTimeBound); } The only impactful difference here is that I split the dataset into smaller batches until a certain threshold is reached. By default, the threshold is set to 20. After this operation, I start to perform the computations. Computations are the same as in the classic stream approach logic described above - I am using the groupByIncomingIp method. JMH Setup All the benchmarks are written using Java Microbenchmark Harness (or JMH for short). I have used JMH in version 1.37 to run benchmarks. Benchmarks share the same setup: five warm-up iterations and twenty measurement iterations. There are two different modes here: average time and throughput. In the case of average time, the JMH measures the average execution time of code under benchmark, and output time is expressed in milliseconds. For throughput, JMH measures the number of operations - full execution of code - in a particular unit of time, milliseconds in this case. The result is expressed in ops per millisecond. In more JMH syntax: Java @Warmup(iterations = 5, time = 10, timeUnit = SECONDS) @Measurement(iterations = 20, time = 10, timeUnit = SECONDS) @BenchmarkMode({Mode.AverageTime, Mode.Throughput}) @OutputTimeUnit(MILLISECONDS) @Fork(1) @Threads(1) Furthermore, each benchmark has its unique State with a Benchmark scope containing all the data and variables needed by a particular benchmark. Benchmark State Classic Stream The base benchmark state for Classic Stream can be viewed below. Java @State(Scope.Benchmark) public class BenchmarkState { @Param({"0"}) public int size; public List<String> input; public ClassicDefinition definitions; public ForkJoinPool forkJoinPool_4; public ForkJoinPool forkJoinPool_8; public ForkJoinPool forkJoinPool_16; public ForkJoinPool forkJoinPool_32; private final LocalDateTime now = LocalDateTime.now(); @Setup(Level.Trial) public void trialUp() { input = new TestDataGen(now).generate(size); definitions = new ClassicDefinition(now); System.out.println(input.size()); } @Setup(Level.Iteration) public void up() { forkJoinPool_4 = new ForkJoinPool(4); forkJoinPool_8 = new ForkJoinPool(8); forkJoinPool_16 = new ForkJoinPool(16); forkJoinPool_32 = new ForkJoinPool(32); } @TearDown(Level.Iteration) public void down() { forkJoinPool_4.shutdown(); forkJoinPool_8.shutdown(); forkJoinPool_16.shutdown(); forkJoinPool_32.shutdown(); } } First, I set up all the variables needed to perform benchmarks. Apart from the size parameter, which is particularly special in this part, thread pools will be used only in the benchmark. The size parameter, on the other hand, is quite an interesting mechanism of JMH. It allows the parametrization of a certain variable used during the benchmark. You will see how I took advantage of this later when we move to running benchmarks. As for now, I am using this parameter to generate the input dataset that will remain unchanged throughout the whole benchmark - to achieve better repeatability of results. The second part is an up method that works similarly to @BeforeEach from the JUnit library. It will be triggered before each of the 20 iterations of my benchmark and reset all the variables used in the benchmark. Thanks to such a setting, I start with a clear state for every iteration. The last part is the down method that works similarly to @AfterEach from the JUnit library. It will be triggered after each of the 20 iterations of my benchmark and shut down all the thread pools used in the iteration - mostly to handle possible memory leaks. Fork/Join The state for the Fork/Join version looks as below. Java @State(Scope.Benchmark) public class ForkJoinState { @Param({"0"}) public int size; public List<String> input; public ForkJoinPool forkJoinPool_4; public ForkJoinPool forkJoinPool_8; public ForkJoinPool forkJoinPool_16; public ForkJoinPool forkJoinPool_32; public final LocalDateTime now = LocalDateTime.now(); @Setup(Level.Trial) public void trialUp() { input = new TestDataGen(now).generate(size); System.out.println(input.size()); } @Setup(Level.Iteration) public void up() { forkJoinPool_4 = new ForkJoinPool(4); forkJoinPool_8 = new ForkJoinPool(8); forkJoinPool_16 = new ForkJoinPool(16); forkJoinPool_32 = new ForkJoinPool(32); } @TearDown(Level.Iteration) public void down() { forkJoinPool_4.shutdown(); forkJoinPool_8.shutdown(); forkJoinPool_16.shutdown(); forkJoinPool_32.shutdown(); } } There is no big difference between the setup for classic stream and Fork/Join. The only difference comes from placing the definitions inside benchmarks themselves, not in state as in the case of the Classic approach. Such change comes from how RecursiveTask works - task executions are memoized and stored - thus, it can impact overall benchmark results. Benchmark Input The basic input for benchmarks is a list of strings in the following format: 1,192.168.1.1,2023-10-29T17:33:33.647641574 Or in a more generalized description: {ordering-number},{ip-like-string},{timestamp} There are five different input sizes: 100 1000 10000 100000 1000000 There is some deeper meaning behind the sizes, as I believe that such a size range can illustrate how well the solution will scale and potentially show some performance bottleneck. Additionally, the overall setup of the benchmark is very flexible, so adding a new size should not be difficult if someone is interested in doing so. Benchmark Setup Classic Stream There is only a single class related to the classic stream benchmark. Different sizes are handled on a State level. Java public class ClassicStreamBenchmark extends BaseBenchmarkConfig { @Benchmark public void bench_sequential(SingleStreamState state, Blackhole bh) { Map<String, Integer> map = state.definitions.sequentialStream(state.input); bh.consume(map); } @Benchmark public void bench_defaultParallelStream(SingleStreamState state, Blackhole bh) { Map<String, Integer> map = state.definitions.defaultParallelStream(state.input); bh.consume(map); } @Benchmark public void bench_parallelStreamWithCustomForkJoinPool_4(SingleStreamState state, Blackhole bh) { Map<String, Integer> map = state.definitions.parallelStreamWithCustomForkJoinPool(state.forkJoinPool_4, state.input); bh.consume(map); } @Benchmark public void bench_parallelStreamWithCustomForkJoinPool_8(SingleStreamState state, Blackhole bh) { Map<String, Integer> map = state.definitions.parallelStreamWithCustomForkJoinPool(state.forkJoinPool_8, state.input); bh.consume(map); } @Benchmark public void bench_parallelStreamWithCustomForkJoinPool_16(SingleStreamState state, Blackhole bh) { Map<String, Integer> map = state.definitions.parallelStreamWithCustomForkJoinPool(state.forkJoinPool_16, state.input); bh.consume(map); } @Benchmark public void bench_parallelStreamWithCustomForkJoinPool_32(SingleStreamState state, Blackhole bh) { Map<String, Integer> map = state.definitions.parallelStreamWithCustomForkJoinPool(state.forkJoinPool_32, state.input); bh.consume(map); } } There are six different benchmark setups of the same logic: bench_sequential: Simple benchmark with just a singular sequential stream bench_defaultParallelStream: Benchmark with default Java parallel stream via .parallelStream() method of Stream class in practice a commonPool from ForkJoinPool and parallelism of 19 (at least on my machine) bench_parallelStreamWithCustomForkJoinPool_4: Custom ForkJoinPool with parallelism level equal to 4 bench_parallelStreamWithCustomForkJoinPool_8: Custom ForkJoinPool with parallelism level equal to 8 bench_parallelStreamWithCustomForkJoinPool_16: Custom ForkJoinPool with parallelism level equal to 16 bench_parallelStreamWithCustomForkJoinPool_32 : Custom ForkJoinPool with parallelism level equal to 32 For classic stream logic, I have 6 different setups and 5 different input sizes resulting in a total of 30 different unique combinations of benchmarks. Fork/Join Java public class ForkJoinBenchmark extends BaseBenchmarkConfig { @Benchmark public void bench(ForkJoinState state, Blackhole bh) { Map<Ip, Integer> map = new ForkJoinDefinition(state.input, state.now).compute(); bh.consume(map); } @Benchmark public void bench_customForkJoinPool_4(ForkJoinState state, Blackhole bh) { ForkJoinDefinition forkJoinDefinition = new ForkJoinDefinition(state.input, state.now); Map<Ip, Integer> map = state.forkJoinPool_4.invoke(forkJoinDefinition); bh.consume(map); } @Benchmark public void bench_customForkJoinPool_8(ForkJoinState state, Blackhole bh) { ForkJoinDefinition forkJoinDefinition = new ForkJoinDefinition(state.input, state.now); Map<Ip, Integer> map = state.forkJoinPool_8.invoke(forkJoinDefinition); bh.consume(map); } @Benchmark public void bench_customForkJoinPool_16(ForkJoinState state, Blackhole bh) { ForkJoinDefinition forkJoinDefinition = new ForkJoinDefinition(state.input, state.now); Map<Ip, Integer> map = state.forkJoinPool_16.invoke(forkJoinDefinition); bh.consume(map); } @Benchmark public void bench_customForkJoinPool_32(ForkJoinState state, Blackhole bh) { ForkJoinDefinition forkJoinDefinition = new ForkJoinDefinition(state.input, state.now); Map<Ip, Integer> map = state.forkJoinPool_32.invoke(forkJoinDefinition); bh.consume(map); } } There are six different benchmark setups of the same logic: bench -> simple benchmark with just a singular sequential stream bench_customForkJoinPool_4: Custom ForkJoinPool with parallelism level equal to 4 bench_customForkJoinPool_8: Custom ForkJoinPool with parallelism level equal to 8 bench_customForkJoinPool_16: Custom ForkJoinPool with parallelism level equal to 16 bench_customForkJoinPool_32: Custom ForkJoinPool with parallelism level equal to 32 For classic stream logic, I have 5 different setups and 5 different input sizes resulting in a total of 25 different unique combinations of benchmarks. What is more, in both cases I am also using the Blackhole concept from JMH to “cheat” the compiler optimization of dead code. There’s more about Blackholes and their use case here. Benchmark Environment Machine 1 The tests we conducted on my Dell XPS with the following parameters: OS: Ubuntu 20.04.6 LTS CPU: i9-12900HK × 20 Memory: 64 GB JVM OpenJDK version "21" 2023-09-19 OpenJDK Runtime Environment (build 21+35-2513) OpenJDK 64-Bit Server VM (build 21+35-2513, mixed mode, sharing) Machine 2 The tests we conducted on my Lenovo Y700 with the following parameters: OS: Ubuntu 20.04.6 LTS CPU: i7-6700HQ × 8 Memory: 32 GB JVM OpenJDK version "21" 2023-09-19 OpenJDK Runtime Environment (build 21+35-2513) OpenJDK 64-Bit Server VM (build 21+35-2513, mixed mode, sharing) For both machines, all side/insignificant applications were closed. I tried to make the runtime system as pure as possible so as to not generate any unwanted performance overhead. However, on a pure Ubuntu server or when run inside a container, the overall performance may differ. Benchmark Report The results of running benchmarks are stored in .csv files and the GitHub repository in the reports directory. Furthermore, to ease the download of reports, there is a separate .zip file named reports.zip that contains all the .csv files with data. Reports directories are structured on per size basis with three special reports for all input sizes: report_classic: All input sizes for classic stream report_forkjoin: All input sizes for fork/join stream report_whole: All input sizes for both classic and fork/join stream Each report directory from the above 3 separate files: averagetime.csv: Results for average time mode benchmarks throughput.csv: Results for throughput mode benchmarks total.csv: Combine results for both modes For the particular reports, I have two formats: averagetime.csv and throughput.csv share one format, and total.csv has a separate one. Let’s call them modes and total formats. The modes report contains eight columns: Label: Name of the benchmark Input Size: Benchmark input size Threads: Number of threads used in benchmark from set 1,4,7,8,16,19,32 Mode: Benchmark mode, either average time or throughput Cnt: The number of benchmark iterations should always be equal to 20 Score: Actual results of benchmark Score Mean Error: Benchmark measurement error Units: Units of benchmark either ms/op (for average time) or ops/ms (for throughput) The total report contains 10 columns: Label: Name of the benchmark Input Size: Benchmark input size Threads: Number of threads used in benchmark from set 1,4,7,8,16,19,32 Cnt: The number of benchmark iterations should always be equal to 20 AvgTimeScore: Actual results of benchmark for average time mode AvgTimeMeanError: Benchmark measurement error for average time mode AvgUnits: Units of benchmark for average time mode in ms/op ThroughputScore: Actual results of benchmark ThroughputMeanError: Benchmark measurement error for throughput mode ThroughputUnits: Units of benchmark for throughput mode in ops/ms Results Analysis Assumptions Baseline I will present general results and insights based on the size of 10000 – so I will be using the .csv files from the report_10000 directory. There are two main reasons behind choosing this particular data size. The execution time is high enough to show any difference based on different setups. Data sizes 100 and 1000 are, in my opinion, too small to notice some performance bottlenecks Thus, I think that an in-depth analysis of this particular data size would be the most impactful. Of course, other sizes will also get a results overview but it will not be as thorough as this one unless I encounter some anomalies – in comparison to the behavior for size 10000. A Word On Fork/Join Native Approach With the current code under benchmark, there will be performance overhead associated with Fork/Join benchmarks. As the fork-join benchmark logic heavily relies on splitting the input dataset there must be a moment when all of the results are combined into a single cohesive output. This is the fragment that is not included in normal benchmarks, so correctly understanding its impact on overall performance is crucial. Please remember about this. Analysis Machine 1 (20 cores) As you can see above the best overall result for input volume 10 thousands on machine 1 belongs to versions with defaultParallelStream. For ClassicStream-based benchmarks, bench_defaultParallelStream returns by far the best result. Even when we take into consideration a possible error in measurements, it still comes on top. Setup for ForkJoinPool with parallelism 32 and 16 and return worse results. On one hand, it is surprising - for parallelism 32, I would expect a better score than for the default pool (parallelism 19). However, parallelism 16 has worse results than both parallelism 19 and 32. With 20 CPU threads on Machine 1, parallelism 32 is not enough to picture performance degradation caused by an overabundance of threads. However, you would be able to notice such behavior for Machine 2. I would assume that to show such behavior on Machine 1, the parallelism should be set to 64 or more. What is curious here is that the relationship with bench_defaultParallelStream coming on top seems not to hold for higher input sizes of 100k and one million. The best performance belongs to bench_parallelStreamWithCustomForkJoinPool_16 which may indicate that in the end, reasonably smaller parallelism may be a good idea. The Fork/Join-based implementation is noticeably slower than the default parallel stream implementation, with around 10 % worse performance. This pattern also occurs for other sizes. It confirms my assumption from above that joining different smaller parts of a split data set has a noticeable impact. Of course, the worst score belongs to the single-threaded approach and is around 5 times slower than the best result. Such a situation is expected, as a single-threaded benchmark is a kind of baseline for me. I want to check how far we can move its execution time and 5 times better average execution time in the best-case scenario seems like a good score. As for the value of the score mean error it is very very small. In the worst case (the highest error), it is within 1,5% of its respectable score (result for ClassicStreamBenchmark.bench_parallelStreamWithCustomForkJoinPool_4). In other cases, it varies from 0,1 % to 0,7 % of the overall score. There seems to be no difference in result positions for sizes above 10 thousand. Machine 2 (8 cores) As in the case of Machine 1, the first score also belongs to bench_defaultParallelStream. Again, even when we consider a possible measurement error, it still comes out on top: nothing especially interesting.What is interesting, however, is that the pattern of the first 3 positions for Machine 2 changes quite a lot based on higher input sizes. For input 100 to 10000, we have somewhat similar behavior, with bench_defaultParallelStream occupying 1 position and bench_parallelStreamWithCustomForkJoinPool_8 following shortly after. On the other hand, for inputs 100000 and 1000000, the first position belongs to bench_parallelStreamWithCustomForkJoinPool_8 followed by bench_parallelStreamWithCustomForkJoinPool_32. While bench_defaultParallelStream is moved to 4th and 3rd positions. Another curious thing about Machine 2 may be that for smaller input sizes, parallelism 32 is quite far away from the top. Such performance degradation may be caused by the overabundance of threads compared to the 8 CPU threads total available on the machine. Nevertheless, on inputs 100000 and 1000000, ForkJoinPool with parallelism 32 is in the second position, which may indicate that for longer time spans, such overabundance of threads is not a problem. Some other aspects that are very similar to the behavior of Machine 1 are skipped here and are mentioned below. Common Points There are a few observations valid for both machines: My ForkJoinNative (“naive”)-based benchmarks yield results that are noticeably worse, around 10% on both machines, than those delivered by default versions of a parallel stream or even ones with customForkJoinPool. Of course, one of the reasons is that they are not optimized in any way. There are probably some low-hanging performance fruits here. Thus, I strongly recommend getting familiar with the Fork/Join framework, before moving its implementations to production. The time difference between positions one to three is very, very small - less than a millisecond. Thus, it may be hard to achieve any type of repeatability for these benchmarks. With such a small difference it is easy for the results distribution to differ between benchmark runs. The mean error of the scores is also very, very small, up to 2% of the overall score in worse cases - mostly less than 1%. Such low error may indicate two things. The first, benchmarks are reliable because results are focused around some point. If there were some anomalies along the way the error would be higher. Second, JMH is good at making measurements. There is no breaking difference in results between throughput and average time modes. If one of the benchmarks performed well in average time mode, it would also perform well in throughput mode. Above you can see all the differences and similarities I found inside the report files. If you find anything else that seems to be interesting do not hesitate to mention it in the comment section below. Summary Before we finally split ways for today, I would like to mention one more very important thing: JAVA IS NOT SLOW Processing the list with one million elements, all potential JMH overhead, and a single thread takes 560 milliseconds (Machine 1) and 1142 milliseconds (Machine 2). There are no special optimizations or magic included, just pure default JVM. The total best time for processing one million elements for Machine 1 was 88 milliseconds for ClassicStreamBenchmark.bench_parallelStreamWithCustomForkJoinPool_16. In the case of Machine 2, it was 321 milliseconds for ClassicStreamBenchmark.bench_parallelStreamWithCustomForkJoinPool_8. Although both results may not be as good as C/C++-based solutions, the relative simplicity and descriptiveness of the approach make it very interesting, in my opinion. Overall, it is quite a nice addition to Java’s one billion rows challenge. I would just like to mention that all the reports and benchmark code are in the GitHub repository (linked in the introduction of this article). You can easily verify my results and compare them to the benchmark behavior on your machine. Furthermore, to ease up the download of reposts there is a separate .zip file named reports.zip that contains all the .csv files with data. Besides, remember Java is not slow. Thank you for your time. Review by: Krzysztof Ciesielski, Łukasz Rola
Stream processing has existed for decades. However, it really kicks off in the 2020s thanks to the adoption of open-source frameworks like Apache Kafka and Flink. Fully managed cloud services make it easy to configure and deploy stream processing in a cloud-native way; even without the need to write any code. This blog post explores the past, present, and future of stream processing. The discussion includes various technologies and cloud services, low code/ no code trade-offs, outlooks into the support of machine learning and GenAI, streaming databases, and the integration between data streaming and data lakes with Apache Iceberg. In December 2023, the research company proved that data streaming is a new software category and not just yet another integration or data platform. Forrester published “The Forrester Wave™: Streaming Data Platforms, Q4 2023“. Get free access to the report here. The leaders are Microsoft, Google, and Confluent, followed by Oracle, Amazon, Cloudera, and a few others. A great time to review the past, present, and future of stream processing as a key component in a data streaming architecture. The Past of Stream Processing: The Move from Batch to Real-Time The evolution of stream processing began as industries sought more timely insights from their data. Initially, batch processing was the norm. Data was collected over a period, stored, and processed at intervals. This method, while effective for historical analysis, proved inefficient for real-time decision-making. In parallel to batch processing, message queues were created to provide real-time communication for transactional data. Message Brokers like IBM MQ or TIBCO EMS were a common way to decouple applications. Applications send data and receive data in an event-driven architecture without worrying about whether the recipient is ready, how to handle backpressure, etc. The stream processing journey began. Stream Processing Is a Journey Over Decades... ... and we are still in a very early stage at most enterprises. Here is an excellent timeline of TimePlus about the journey of stream processing open source frameworks, proprietary platforms, and SaaS cloud services: Source: TimePlus The stream processing journey started decades ago with research and the first purpose-built proprietary products for specific use cases like stock trading. Open source stream processing frameworks emerged during the big data and Hadoop era to make at least the ingestion layer a bit more real-time. Today, most enterprises at least get started understanding the value of stream processing for analytical and transactional use cases across industries. The cloud is a fundamental change as you can start streaming and processing data with a button click leveraging fully managed SaaS and simple UIs (if you don't want to operate infrastructure or write low-level source code). TIBCO StreamBase, Software AG Apama, IBM Streams The advent of message queue technologies like IBM MQ and TIBCO EMS moved many critical applications to real-time message brokers. Real-time messaging enables the consumption of data in real-time to store it in a database, mainframe, or application for further processing. However, only true stream processing capabilities included in tools like TIBCO StreamBase, Software AG Apama, or IBM (InfoSphere) Streams marked a significant shift towards real-time data processing. These products enable businesses to react to information as it arrives by processing and correlating the data in motion. Visual coding in tools like StreamBase or Apama represents an innovative approach to developing stream processing solutions. These tools provide a graphical interface that allows developers and analysts to design, build, and test applications by connecting various components and logic blocks visually, rather than writing code manually. Under the hood, the code generation worked with a Streaming SQL language. Here is a screenshot of the TIBCO StreamBase IDE for visual drag & drop of streaming pipelines: TIBCO StreamBase IDE Some drawbacks of these early stream processing solutions include high cost, vendor lock-in, no flexibility regarding tools or APIs, and missing communities. These platforms are monolithic and were built far before cloud-native elasticity and scalability became a requirement for most RFIs and RFPs when evaluating vendors. Open Source Event Streaming With Apache Kafka The actual significant change for stream processing came with the introduction of Apache Kafka, a distributed streaming platform that allowed for high-throughput, fault-tolerant handling of real-time data feeds. Kafka, alongside other technologies like Apache Flink, revolutionized the landscape by providing the tools necessary to move from batch to real-time stream processing seamlessly. The adoption of open-source technologies changed all industries. Openness, flexibility, and community-driven development enabled easier influence on the features and faster innovation. Over 100.000 organizations use Apache Kafka. The massive adoption came from a unique combination of capabilities: Messaging, storage, data integration, and stream processing, all in one scalable and distributed infrastructure. Various open-source stream processing engines emerged. Kafka Streams was added to the Apache Kafka project. Other examples include Apache Storm, Spark Streaming, and Apache Flink. The Present of Stream Processing: Architectural Evolution and Mass Adoption The fundamental change to processing data in motion has enabled the development of data products and data mesh. Decentralizing data ownership and management with domain-driven design and technology-independent microservices promotes a more collaborative and flexible approach to data architecture. Each business unit can choose its own technology, API, cloud service, and communication paradigms like real-time, batch, or request-response. From Lambda Architecture to Kappa Architecture Today, stream processing is at the heart of modern data architecture, thanks in part to the emergence of the Kappa architecture. This model simplifies the traditional Lambda Architecture by using a single stream processing system to handle both real-time and historical data analysis, reducing complexity and increasing efficiency. Lambda architecture with separate real-time and batch layers: Kappa architecture with a single pipeline for real-time and batch processing: For more details about the pros and cons of Kappa vs. Lambda, check out my "Kappa Architecture is Mainstream Replacing Lambda". It explores case studies from Uber, Twitter, Disney and Shopify. Kafka Streams and Apache Flink Become Mainstream Apache Kafka has become synonymous with building scalable and fault-tolerant streaming data pipelines. Kafka facilitating true decoupling of domains and applications makes it integral to microservices and data mesh architectures. Plenty of stream processing frameworks, products, and cloud services emerged in the past years. This includes open-source frameworks like Kafka Streams, Apache Storm, Samza, Flume, Apex, Flink, and Spark Streaming, and cloud services like Amazon Kinesis, Google Cloud Dataflow, and Azure Stream Analytics. The "Data Streaming Landscape 2024" gives an overview of relevant technologies and vendors. Apache Flink seems to become the de facto standard for many enterprises (and vendors). The adoption is like Kafka four years ago: Source: Confluent This does not mean other frameworks and solutions are bad. For instance, Kafka Streams is complementary to Apache Flink, as it suits different use cases. No matter what technology enterprises choose, the mass adoption of stream processing is in progress right now. This includes modernizing existing batch processes AND building innovative new business models that only work in real-time. As a concrete example, think about ride-hailing apps like Uber, Lyft, FREENOW, and Grab. They are only possible because events are processed and correlated in real time. Otherwise, everyone would still prefer a traditional taxi. Stateless and Stateful Stream Processing In data streaming, stateless and stateful stream processing are two approaches that define how data is handled and processed over time: The choice between stateless and stateful processing depends on the specific requirements of the application, including the nature of the data, the complexity of the processing needed, and the performance and scalability requirements. Stateless Stream Processing Stateless Stream Processing refers to the handling of each data point or event independently from others. In this model, the processing of an event does not depend on the outcomes of previous events or requires keeping track of the state between events. Each event is processed based on the information it contains, without the need for historical context or future data points. This approach is simpler and can be highly efficient for tasks that don't require knowledge beyond the current event being processed. The implementation could be a stream processor (like Kafka Streams or Flink), functionality in a connector (like Kafka Connect Single Message Transforms), or a Web Assembly (WASM) embedded into a streaming platform. Stateful Stream Processing Stateful Stream Processing involves keeping track of information (state) across multiple events to perform computations that depend on data beyond the current event. This model allows for more complex operations like windowing (aggregating events over a specific time frame), joining streams of data based on keys, and tracking sequences of events or patterns over time. Stateful processing is essential for scenarios where the outcome depends on accumulated knowledge or trends derived from a series of data points, not just on a single input. The implementation is much more complex and challenging than stateless stream processing. A dedicated stream processing implementation is required. Dedicated distributed engines (like Apache Flink) handle stateful computations, memory usage, and scalability better than Kafka-native tools like Kafka Streams or KSQL (because the latter are bound to Kafka Topics). Low Code, No Code, AND A Lot of Code! No-code and low-code tools are software platforms that enable users to develop applications quickly and with minimal coding knowledge. These tools provide graphical user interfaces with drag-and-drop capabilities, allowing users to assemble and configure applications visually rather than writing extensive lines of code. Common features and benefits of visual coding: Rapid development: Both types of platforms significantly reduce development time, enabling faster delivery of applications. User-friendly interface: The graphical interface and drag-and-drop functionality make it easy for users to design, build, and iterate on applications. Cost reduction: By enabling quicker development with fewer resources, these platforms can lower the cost of software creation and maintenance. Accessibility: They make application development accessible to a broader range of people, reducing the dependency on skilled developers for every project. So far, the theory. Disadvantages of Visual Coding Tools Actually, StreamBase, Apama, et al., had great visual coding offerings. However, no-code/low-code tools have many drawbacks and disadvantages, too: Limited customization and flexibility: While these platforms can speed up development for standard applications, they may lack the flexibility needed for highly customized solutions. Developers might find it challenging to implement specific functionalities that aren't supported out of the box. Dependency on vendors: Using no-code/low-code platforms often means relying on third-party vendors for the platform's stability, updates, and security. This dependency can lead to potential issues if the vendor cannot maintain the platform or goes out of business. And often the platform team is the bottleneck for implementing new business or integration logic. Performance concerns: Applications built with no-code/low-code platforms may not be as optimized as those developed with traditional coding, potentially leading to lower performance or inefficiencies, especially for complex applications. Scalability issues: As businesses grow, applications might need to scale up to support increased loads. No-code/low-code platforms might not always support this level of scalability or might require significant workarounds, affecting performance and user experience. Over-reliance on non-technical users: While empowering citizen developers is a key advantage of these platforms, it can also lead to governance challenges. Without proper oversight, non-technical users might create inefficient workflows or data structures, leading to technical debt and maintenance issues. Cost over time: Initially, no-code/low-code platforms can reduce development costs. However, as applications grow and evolve, the ongoing subscription costs or fees for additional features and scalability can become significant. Flexibility Is King: Stream Processing for Everyone! Microservices, domain-driven design, data mesh... All these modern design approaches taught us to provide flexible enterprise architectures. Each business unit and persona should be able to choose its own technology, API, or SaaS. And no matter if you do real-time, near real-time, batch, or request-response communication. Apache Kafka provides the true decoupling out-of-the-box. Therefore, low-code or now-code tools are an option. However, a data scientist, data engineer, software developer, or citizen integrator can choose their own technology for stream processing. The past, present, and future of stream processing show different frameworks, visual coding tools and even applied generative AI. One solution does NOT replace but complement the other alternatives: The Future of Stream Processing: Serverless SaaS, GenAI, and Streaming Databases Stream processing is set to grow exponentially in the future, thanks to advancements in cloud computing, SaaS, and AI. Let's explore the future of stream processing and look at the expected short, mid and long-term developments. SHORT TERM: Fully Managed Serverless SaaS for Stream Processing The cloud's scalability and flexibility offer an ideal environment for stream processing applications, reducing the overhead and resources required for on-premise solutions. As SaaS models continue to evolve, stream processing capabilities will become more accessible to a broader range of businesses, democratizing real-time data analytics. For instance, look at the serverless Flink Actions in Confluent Cloud. You can configure and deploy stream processing for use cases like deduplication or masking without any code: Source: Confluent MIDTERM: Automated Tooling and the Help of GenAI Integrating AI and machine learning with stream processing will enable more sophisticated predictive analytics. This opens new frontiers for automated decision-making and intelligent applications while continuously processing incoming event streams. The full potential of embedding AI into stream processing has to be learned and implemented in the upcoming years. For instance, automated data profiling is one instance of stream processing that GenAI can support significantly. Software tools analyze and understand the quality, structure, and content of a dataset without manual intervention as the events flow through the data pipeline in real time. This process typically involves examining the data to identify patterns, anomalies, missing values, and inconsistencies. A perfect fit for stream processing! Automated data profiling in the stream processor can provide insights into data types, frequency distributions, relationships between columns, and other metadata information crucial for data quality assessment, governance, and preparation for further analysis or processing. MIDTERM: Streaming Storage and Analytics With Apache Iceberg Apache Iceberg is an open-source table format for huge analytic datasets that provides powerful capabilities in managing large-scale data in data lakes. Its integration with streaming data sources like Apache Kafka and analytics platforms, such as Snowflake, Starburst, Dremio, AWS Athena, or Databricks, can significantly enhance data management and analytics workflows. Integration Between Streaming Data From Kafka and Analytics on Databricks or Snowflake Using Apache Iceberg Supporting the Apache Iceberg table format might be a crucial strategic move by streaming and analytics frameworks, vendors, and cloud services. Here are some key benefits from the enterprise architecture perspective: Unified batch and stream processing: Iceberg tables can serve as a bridge between streaming data ingestion from Kafka and downstream analytic processing. By treating streaming data as an extension of a batch-based table, Iceberg enables a seamless transition from real-time to batch analytics, allowing organizations to analyze data with minimal latency. Schema evolution: Iceberg supports schema evolution without breaking downstream systems. This is useful when dealing with streaming data from Kafka, where the schema might evolve. Consumers can continue reading data using the schema they understand, ensuring compatibility and reducing the need for data pipeline modifications. Time travel and snapshot isolation: Iceberg's time travel feature allows analytics on data as it looked at any point in time, providing snapshot isolation for consistent reads. This is crucial for reproducible reporting and debugging, especially when dealing with continuously updating streaming data from Kafka. Cross-platform compatibility: Iceberg provides a unified data layer accessible by different compute engines, including those used by Databricks and Snowflake. This enables organizations to maintain a single copy of their data that is queryable across different platforms, facilitating a multi-tool analytics ecosystem without data silos. LONG-TERM: Transactional + Analytics = Streaming Database? Streaming databases, like RisingWave or Materialize, are designed to handle real-time data processing and analytics. This offers a way to manage and query data that is continuously generated from sources like IoT devices, online transactions, and application logs. Traditional databases that are optimized for static data are stored on disk. Instead, streaming databases are built to process and analyze data in motion. They provide insights almost instantaneously as the data flows through the system. Streaming databases offer the ability to perform complex queries and analytics on streaming data, further empowering organizations to harness real-time insights. The ongoing innovation in streaming databases will probably lead to more advanced, efficient, and user-friendly solutions, facilitating broader adoption and more creative applications of stream processing technologies. Having said this, we are still in the very early stages. It is not clear yet when you really need a streaming database instead of a mature and scalable stream processor like Apache Flink. The future will show us that competition is great for innovation. The Future of Stream Processing is Open Source and Cloud The journey from batch to real-time processing has transformed how businesses interact with their data. The continued evolution couples technologies like Apache Kafka, Kafka Streams, and Apache Flink with the growth of cloud computing and SaaS. Stream processing will redefine the future of data analytics and decision-making. As we look ahead, the future possibilities for stream processing are boundless, promising more agile, intelligent, and real-time insights into the ever-increasing streams of data. If you want to learn more, listen to the following on-demand webinar about the past, present, and future of stream processing with me joined by the two streaming industry veterans Richard Tibbets (founder of StreamBase) and Michael Benjamin (TimePlus). I had the please work with them for a few years at TIBCO where we deployed StreamBase at many Financial Services companies for stock trading and similar use cases: What does your stream processing journey look like? In which decade did you join? Or are you just learning with the latest open-source frameworks or cloud services? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.
Advanced SQL is an indispensable tool for retrieving, analyzing, and manipulating substantial datasets in a structured and efficient manner. It is extensively utilized in data analysis and business intelligence, as well as in various domains such as software development, finance, and marketing. Mastering advanced SQL can empower you to: Efficiently retrieve and analyze large datasets from databases. Create intricate reports and visualizations to derive meaningful insights from your data. Write optimized queries to enhance the performance of your database. Utilize advanced features such as window functions, common table expressions, and recursive queries. Understand and fine-tune the performance of your database. Explore, analyze, and derive insights from data more effectively. Provide data-driven insights and make decisions based on solid evidence. In today's data-driven landscape, the ability to handle and interpret big data is increasingly vital. Proficiency in advanced SQL can render you a valuable asset to any organization that manages substantial amounts of data. Below are some examples of advanced SQL queries that illustrate the utilization of complex and powerful SQL features: Using Subqueries in the SELECT Clause SQL SELECT customers.name, (SELECT SUM(amount) FROM orders WHERE orders.customer_id = customers.id) AS total_spent FROM customers ORDER BY total_spent DESC; This query employs a subquery in the SELECT clause to compute the total amount spent by each customer, returning a list of customers along with their total spending, ordered in descending order. Using the WITH Clause for Common Table Expressions (CTEs) SQL WITH top_customers AS (SELECT customer_id, SUM(amount) AS total_spent FROM orders GROUP BY customer_id ORDER BY total_spent DESC LIMIT 10), customer_info AS (SELECT id, name, email FROM customers) SELECT customer_info.name, customer_info.email, top_customers.total_spent FROM top_customers JOIN customer_info ON top_customers.customer_id = customer_info.id; This query uses the WITH clause to define two CTEs, "top_customers" and "customer_info" which simplifies and modularizes the query. The first CTE identifies the top 10 customers based on their total spending, and the second CTE retrieves customer information. The final result is obtained by joining these two CTEs. Using Window Functions To Calculate Running Totals SQL SELECT name, amount, SUM(amount) OVER (PARTITION BY name ORDER BY date) AS running_total FROM transactions ORDER BY name, date; This query utilizes a window function,`SUM(amount) OVER (PARTITION BY name ORDER BY date)`, to calculate the running total of transactions for each name. It returns all transactions along with the running total for each name, ordered by name and date. Using Self-Join SQL SELECT e1.name AS employee, e2.name AS manager FROM employees e1 JOIN employees e2 ON e1.manager_id = e2.id; This query employs a self-join to link a table to itself, illustrating the relationship between employees and their managers. It returns a list of all employees and their corresponding managers. Using JOIN, GROUP BY, HAVING SQL SELECT orders.product_id, SUM(order_items.quantity) AS product_sold, products.name FROM orders JOIN order_items ON orders.id = order_items.order_id JOIN products ON products.id = order_items.product_id GROUP BY orders.product_id HAVING SUM(order_items.quantity) > 100; This query uses JOIN to combine the orders and order_items tables on the order_id column, and joins with the product table on the product_id column. It then uses the GROUP BY clause to group results by product_id and the HAVING clause to filter products with more than 100 units sold. The SELECT clause lists the product_id, total quantity sold, and product name. Using COUNT() and GROUP BY SQL SELECT department, COUNT(employee_id) AS total_employees FROM employees GROUP BY department ORDER BY total_employees DESC; This query uses the COUNT() function to tally the number of employees in each department and the GROUP BY clause to group results by department. The SELECT clause lists the department name and total number of employees, ordered by total employees in descending order. Using UNION and ORDER BY SQL (SELECT id, name, 'customer' AS type FROM customers) UNION (SELECT id, name, 'employee' AS type FROM employees) ORDER BY name; This query uses the UNION operator to combine the results of two separate SELECT statements—one for customers and one for employees — and orders the final result set by name. The UNION operator removes duplicates if present. Recursive Queries A recursive query employs a self-referencing mechanism to perform tasks, such as traversing a hierarchical data structure like a tree or graph. Example: SQL WITH RECURSIVE ancestors (id, parent_id, name) AS ( -- Anchor query to select the starting node SELECT id, parent_id, name FROM nodes WHERE id = 5 UNION -- Recursive query to select the parent of each node SELECT nodes.id, nodes.parent_id, nodes.name FROM nodes JOIN ancestors ON nodes.id = ancestors.parent_id ) SELECT * FROM ancestors; This query uses a CTE called "ancestors" to define the recursive query with columns: id, parent_id, and name. The anchor query selects the starting node (id = 5), and the recursive query selects each node's parent, joining it with the "ancestors" CTE on the parent_id column. This process continues until the root of the tree is reached or the maximum recursion level is attained. The final query retrieves all identified ancestors. While recursive queries are potent, they can be resource-intensive; therefore, they should be used judiciously to avoid performance issues. Ensure proper recursion termination and consider the maximum recursion level permitted by your DBMS. Not all SQL implementations support recursion, but major RDBMS systems such as PostgreSQL, Oracle, SQL Server, and SQLite do support recursive queries using the WITH RECURSIVE keyword. These examples showcase just a few of SQL's powerful capabilities and the diverse types of queries you can construct. The specific details of the queries will depend on your database structure and the information you seek to retrieve, but these examples should provide a foundational understanding of what is achievable with advanced SQL.
When I think about technical debt, I still remember the first application I created that made me realize the consequences of an unsuitable architecture. It happened back in the late 1990s when I was first getting started as a consultant. The client had requested the use of the Lotus Notes platform to build a procurement system for their customers. Using the Lotus Notes client and a custom application, end-users could make requests that would be tracked by the application and fulfilled by the product owner’s team. In theory, it was a really cool idea – especially since web-developed applications were not prevalent and everyone used Lotus Notes on a daily basis. The core problem is that the data was very relational in design – and Lotus Notes was not a relational database. The solution’s design required schema management within every Lotus Notes document and leaned on a series of multi-value fields to simulate the relationships between data attributes. It was a mess. A great deal of logic in the Lotus Notes application would not have been required if a better platform had been recommended. The source code was complicated to support. Enhancements to the data structure resulted in major refactoring of the underlying code – not to mention running server-based jobs to convert the existing data. Don’t get me started on the effort behind report creation. Since I was early in my career I was focused on providing a solution that the client wanted over trying to offer a better solution. This was certainly a lesson I learned early in my career, but in the years since that project, I’ve come to realize that the consequence of architectural technical debt is an unfortunate reality we all face. Let’s explore the concept of architecture tech debt a little more at a macro level. Architectural Tech Debt (ATD) The Architectural Technical Debt (ATD) Library at Carnegie Mellon University provides the following definition of ATD: Architectural technical debt is a design or construction approach that's expedient in the short term, but that creates a technical context in which the same work requires architectural rework and costs more to do later than it would cost to do now (including increased cost over time). In the “Quick Answer: How to Manage Architecture Technical Debt” (published 09/22/2023), Gartner Group defines ATD as follows: Architecture technical debt is that type of technical debt that is caused by architectural drift, suboptimal architectural decisions, violations of defined target product architecture and established industry architectural best practices, and architecture trade-offs made for faster software delivery. In both cases, benefits that often yield short-term celebrations can be met with long-term challenges. This is similar to my Lotus Notes example mentioned in the introduction. To further complicate matters, tooling to help identify and manage tech debt for software architecture has been missing in comparison to the other aspects of software development: For code quality, observability, and SCA, proven tooling exists with products like Sonarqube, Datadog, New Relic, GitHub, and Snyk. However, the software architecture segment has lagged behind without any proven solutions. This is unfortunate, given the fact that ATD is consistently the largest – and most damaging – type of technical debt as found in the “Measure It? Manage It? Ignore It? Software Practitioners and Technical Debt” 2015 study published by Carnegie Mellon. The following illustration summarizes Figure 4 from that report, concluding that bad architecture choices were the clear leader in sources of technical debt. If not managed, ATD can continue to grow over time at an increasing rate as demonstrated in this simple illustration: Without mitigation, architecture debt will eventually reach a breaking point for the underlying solution being measured. Managing ATD Before we can manage ATD, we must first understand the problem. Desmond Tutu once wisely said that “There is only one way to eat an elephant: a bite at a time.” The shift-left approach embraces the concept of moving a given aspect closer to the beginning than at the end of a lifecycle. This concept gained popularity with shift-left for testing, where the test phase was moved to a part of the development process and not a separate event to be completed after development was finished. Shift-left can be implemented in two different ways in managing ATD: Shift-left for resiliency: Identifying sources that have an impact on resiliency, and then fixing them before they manifest in performance. Shift-left for security: Detect and mitigate security issues during the development lifecycle. Just like shift-left for testing, a prioritized focus on resilience and security during the development phase will reduce the potential for unexpected incidents. Architectural Observability Architectural observability gives engineering teams the ability to incrementally address architectural drift within their services at a macro level. In fact, the Wall Street Journal reported the cost to fix technical debt at $1.52 trillion earlier this year in “The Invisible $1.52 Trillion Problem: Clunky Old Software,” article. To be successful, engineering leadership must be in full alignment with the following organizational objectives: Resiliency: To recover swiftly from unexpected incidents. Scalability: To scale appropriately with customer demand. Velocity: To deliver features and enhancements in line with product expectations. Cloud Suitability: Transforming legacy solutions into efficient cloud-native service offerings. I recently discovered vFunction’s AI-driven architectural observability platform, which is focused on the following deliverables: Discover the real architecture of solutions via static and dynamic analysis. Prevent architecture drift via real-time views of how services are evolving. Increase the resiliency of applications via the elimination of unnecessary dependencies and improvements between application domains and their associated resources. Manage and remediate tech debt via AI-driven observability. Additionally, the vFunction platform provides the side-benefit of providing a migration path to transform from monoliths to cloud-native solutions. Once teams have modernized their platforms, they can continuously observe them for ongoing drift. If companies already have microservices, they can use vFunction to detect complexity in distributed applications and address dependencies that impact resiliency and scalability. In either case, once implemented, engineering teams can mitigate ATD well before reaching the breaking point. In the illustration above, engineering teams are able to mitigate technical debt as a part of each release, due to the implementation of the vFunction platform and an underlying shift-left approach. Conclusion My readers may recall that I have been focused on the following mission statement, which I feel can apply to any IT professional: “Focus your time on delivering features/functionality that extends the value of your intellectual property. Leverage frameworks, products, and services for everything else.” — J. Vester The vFunction platform adheres to my mission statement by helping engineering teams employ a shift-left approach to the resiliency and security of their services at a macro level. This is an important distinction because without such tooling teams are likely to mitigate at a micro level resolving tech debt that doesn’t really matter from an organizational perspective. When I think back to that application that made me realize the challenges with tech debt, I can’t help but think about how that solution yielded more issues than it did benefits with each feature that was introduced. Certainly, the use of shift-left for resiliency alone would have helped surface issues with the underlying architecture at a point where the cost to consider alternatives would be feasible. If you are interested in learning more about the vFunction solution, you can read more about them here. Have a really great day!
I bet you might have come across a scenario while automating API/web or mobile applications where, while registering a user, you may be setting the address for checking out a product in the end-to-end user journey in test automation. So, how do you do that? Normally, we create a POJO class in Java with the fields required to register a user or to set the address for checking out the product and then set the values in the test using the constructor of the POJO class. Let’s take a look at an example of registering a user where the following are mandatory fields required to fill in the registration form: First Name Last Name Address City State Country Mobile Number As we need to handle these fields in automation testing we will have to pass on respective values in the fields at the time of executing the tests. Before Using the Builder Pattern A POJO class, with the above-mentioned mandatory fields, will be created with the Getter and Setter methods inside that POJO class, and using a constructor values are set in the respective fields. Check out the code example of RegisterUser class given below for the representation of what we are discussing. Java public class RegisterUser { private String firstName; private String lastName; private String address; private String city; private String state; private String country; private String mobileNumber; public RegisterUser (final String firstName, final String lastName, final String address, final String city, final String state, final String country, final String mobileNumber) { this.firstName = firstName; this.lastName = lastName; this.address = address; this.city = city; this.state = state; this.country = country; this.mobileNumber = mobileNumber; } public String getFirstName () { return firstName; } public void setFirstName (final String firstName) { this.firstName = firstName; } public String getLastName () { return lastName; } public void setLastName (final String lastName) { this.lastName = lastName; } public String getAddress () { return address; } public void setAddress (final String address) { this.address = address; } public String getCity () { return city; } public void setCity (final String city) { this.city = city; } public String getState () { return state; } public void setState (final String state) { this.state = state; } public String getCountry () { return country; } public void setCountry (final String country) { this.country = country; } public String getMobileNumber () { return mobileNumber; } public void setMobileNumber (final String mobileNumber) { this.mobileNumber = mobileNumber; } } Now, if we want to use this POJO, we would have to create an instance of RegisterUser class and pass the values in the constructor parameters as given in the code example below to set the data in the respective fields. Check out the below example of the Register User test of how we do it. Java public class RegistrationTest { @Test public void testRegisterUser () { RegisterUser registerUser = new RegisterUser ("John", "Doe", "302, Adam Street, 1st Lane", "New Orleans", "New Jersey", "US", "52145364"); assertEquals (registerUser.getFirstName (), "John"); assertEquals (registerUser.getCountry (), "US"); } } There were just seven fields in the example we took for registering the user. However, this would not be the case with every application. There would be some more additional fields required and as the fields keep on increasing every time, we would need to update the POJO class with respective Getter and Setter methods and also update the parameters in the constructor. Finally, we would need to add the values to those fields so the data could be passed in the actual field required. Long story short, we would need to update the code even if there is a single new field added, also, it doesn’t look clean to add values as parameters in the tests. Luckily, the Builder Design Pattern in Java comes to the rescue here. What Is Builder Design Pattern in Java? Builder design pattern is a creational design pattern that lets you construct complex objects step by step. The pattern allows you to produce different types and representations of an object using the same construction code. Builder Pattern helps us solve the issue of setting the parameters by providing a way to build the objects step by step by providing a method that returns the final object which can be used in the actual tests. What Is Lombok? Project Lombok is a Java library that automatically plugs into your editor and builds tools, spicing up your Java. It is an annotation-based Java library that helps in reducing the boilerplate code. It helps us in writing short and crisp code without having to write the boilerplate code. Bypassing the @Getterannotation over the class, it automatically generates Getter methods. Similarly, you don’t have to write the code for Setter methods as well, its @Setterannotation updated over the class automatically generates the Setter methods. It also has support for using the Builder design pattern so we just need to put the @Builderannotation above the class and the rest will be taken care of by the Lombok library. To use Lombok annotations in the project we need to add the following Maven dependency: <!-- https://mvnrepository.com/artifact/org.projectlombok/lombok --> <dependency> <groupId>org.projectlombok</groupId> <artifactId>lombok</artifactId> <version>1.18.32</version> <scope>provided</scope> </dependency> Using the Builder Design Pattern With Lombok Before we start refactoring the code we have written, let me tell you about the DataFaker library as well as how it helps in generating fake data that can be used for testing. Ideally, in our example, every newly registered user’s data should be unique otherwise we may get a duplicate data error and the test will fail. Here, the DataFaker library will help us in providing unique data in each test execution thereby helping us with registering a new user with unique data every time the registration test is run. To use the DataFaker library, we need to add the following Maven dependency to our project. XML <!-- https://mvnrepository.com/artifact/net.datafaker/datafaker --> <dependency> <groupId>net.datafaker</groupId> <artifactId>datafaker</artifactId> <version>2.2.2</version> </dependency> Now, let's start refactoring the code. First, we will make the changes to the RegisterUser class. We would be removing all the Getter and Setter methods and also the constructor and adding the @Getter and @Builder annotation tags on the top of the RegisterUser class. Here is how the RegisterUser class looks now after the refactoring Java @Getter @Builder public class RegisterUserWithBuilder { private String firstName; private String lastName; private String address; private String city; private String state; private String country; private String mobileNumber; } How clean and crisp it looks with that refactoring being done. Multiple lines of code are removed still it will still work in the same fashion as it used to earlier, thanks to Lombok. We would have to add a new Java class for generating the fake data on runtime using the Builder design pattern. We would be calling this new class the DataBuilder class. Java public class DataBuilder { private static final Faker FAKER = new Faker(); public static RegisterUserWithBuilder getUserData () { return RegisterUserWithBuilder.builder () .firstName (FAKER.name () .firstName ()) .lastName (FAKER.name () .lastName ()) .address (FAKER.address () .streetAddress ()) .state (FAKER.address () .state ()) .city (FAKER.address () .city ()) .country (FAKER.address () .country ()) .mobileNumber (String.valueOf (FAKER.number () .numberBetween (9990000000L, 9999999999L))) .build (); } } The getUserData() method will return the test data required for registering the user using the DataFaker library. Notice the builder() method used after the class name RegisterUserWithBuilder. It appears because of the @Builder annotation we have used on the top of the RegisterUserWithBuilder class. After the builder() method we have to pass on the variables we have declared in the RegisterUserWithBuilder class and accordingly, pass the fake data that we need to generate for the respective variables. Java RegisterUserWithBuilder.builder () .firstName (FAKER.name () .firstName ()); The above piece of code will generate a fake first name and set it in the first name variable. Likewise, we have set fake data in all other variables. Now, let’s move towards how we use these data in the tests. It's very simple, the below code snippet explains it all. Java @Test public void testRegisterUserWithBuilder () { RegisterUserWithBuilder registerUserWithBuilder = getUserData (); System.out.println (registerUserWithBuilder.getFirstName ()); System.out.println (registerUserWithBuilder.getLastName ()); System.out.println (registerUserWithBuilder.getAddress ()); System.out.println (registerUserWithBuilder.getCity ()); System.out.println (registerUserWithBuilder.getState ()); System.out.println (registerUserWithBuilder.getCountry ()); System.out.println (registerUserWithBuilder.getMobileNumber ()); } We just need to call the getUserData() method while instantiating the RegisterUserWithBuilder class. Next, we would be calling the Getter methods for the respective variables we declared inside the RegisterUserWithBuilder class. Remember we had passed the @Getter annotation on the top of the RegisterUserWithBuilder class, this actually helps in calling the Getter methods here. Also, we are not required to pass on multiple data as the constructor parameters for the RegisterUserWithBuilder class, instead we just need to instantiate the class and call the getUserData() method! How easy it is to generate the unique data and pass it on in the tests without having to write multiple lines of boilerplate codes. Thanks to the Builder design pattern and Lombok! Running the Test Let’s run the test and check if the user details get printed in the console. We can see that the fake data is generated successfully in the following screenshot of the test execution. Conclusion In this blog, we discussed making use of the builder design pattern in Java with Lombok and DataFaker library to generate fake test data on run time and use it in our automated tests. It would ease the test data generation process by eliminating the need to update the test data before running the tests. I hope it will help you a lot in reducing the code lines and writing your code in a much cleaner way. Happy Testing!
In the first part of our system design series, we introduced MarsExpress, a fictional startup tackling the challenge of scaling from a local entity to a global presence. We explored the initial steps of transitioning from a monolithic architecture to a scalable solution, setting the stage for future growth. As we continue this journey, Part 2 focuses on the critical role of the caching layer. We’ll delve into the technological strategies and architectural decisions essential for implementing effective caching, which are crucial for handling millions of users efficiently. architectural decisions pivotal for scaling to meet the demands of millions. Cache Layer For read-heavy applications, relying solely on a primary-replica database architecture often falls short of meeting the performance and scalability demands. While this architecture can improve read throughput by distributing read queries among replicas, it still encounters bottlenecks, especially under massive read load scenarios. This is where the implementation of a distributed cache layer becomes not just beneficial, but essential. A distributed cache, positioned in front of the database layer, can serve frequently accessed data with significantly lower latency than a database, dramatically reducing the load on the primary database and its replicas. By caching the results of read operations, applications can achieve instant data access for the majority of requests, leading to a more responsive user experience and higher throughput. Moreover, a distributed cache scales horizontally, offering a more flexible and cost-effective solution for managing read-heavy workloads compared to scaling databases vertically or adding more replicas. This approach not only alleviates pressure on the database but also ensures high availability and fault tolerance, as cached data is distributed across multiple nodes, minimizing the impact of individual node failures. In essence, for read-heavy applications aiming for scalability and high performance, incorporating a distributed cache layer is a critical strategy that complements and extends the capabilities of primary-replica database architectures. A key characteristic of distributed caches is their ability to handle massive amounts of data by partitioning it across multiple nodes. This approach, often implemented using consistent hashing, balances the load evenly and allows for easy scaling by adding or removing nodes. Additionally, replication ensures data redundancy, enhancing fault tolerance. For instance, Redis Cluster and Hazelcast are popular implementations that provide automatic data partitioning and failover. One of the primary benefits of distributed caches is their ability to significantly improve read performance. By caching frequently accessed data across multiple nodes, applications can serve data with minimal latency, bypassing the database for most read requests. This reduction in database load not only improves response times but also enhances overall system throughput. Furthermore, eviction policies like Least Recently Used (LRU) or Least Frequently Used (LFU) help manage memory efficiently by discarding less important data, ensuring the cache remains performant. However, implementing a distributed cache requires careful consideration of several factors. Network latency can be a critical issue, especially in geo-distributed setups, and must be minimized through strategic placement of cache nodes. Consistency models, ranging from eventual consistency to strong consistency, need to be chosen based on application requirements. Security is also paramount, necessitating encryption of data in transit and at rest, along with robust authentication and authorization mechanisms. Monitoring and metrics play a crucial role in maintaining a distributed cache. Tracking metrics such as hit/miss ratios, latency, and throughput helps identify performance bottlenecks and optimize the cache configuration. Regular monitoring ensures the cache operates efficiently, adapting to changing workloads and maintaining high availability. Distributed caches excel in environments with high read traffic, such as social media platforms, e-commerce sites, and real-time analytics systems. They are also effective in session storage, providing quick access to user session data across large-scale web applications. By leveraging distributed caches, engineers can significantly enhance the performance, scalability, and reliability of their systems, ensuring they can meet the demands of millions of users. Sharding and Horizontal Scaling Sharding and horizontal scaling are fundamental strategies in distributed systems to improve performance, scalability, and fault tolerance. Each approach addresses different aspects of data distribution and system growth: Sharding Shading involves dividing data into smaller subsets (shards) and distributing them across multiple nodes or databases. Each shard operates independently, handling a portion of the overall workload. Sharding enhances scalability by allowing distributed systems to manage larger datasets and higher transaction volumes effectively. Horizontal Scaling Horizontal scaling refers to adding more identical resources (e.g., servers, cache nodes) to a system to distribute workload and increase capacity. It aims to improve performance and accommodate growing demands by leveraging additional hardware resources in parallel. In distributed caching systems, sharding is crucial for efficiently managing data across multiple cache nodes. By partitioning data into shards and distributing them among cache servers, sharding enhances data locality and reduces contention for resources. Each cache node manages a subset of data, enabling parallel processing and improving overall throughput. For example, in a sharded Redis cluster, data keys are distributed across multiple Redis instances (shards), ensuring scalable read and write operations across the cache. On the other side, horizontal scaling complements sharding by allowing distributed caching systems to expand capacity seamlessly. Adding more cache nodes enhances system performance and accommodates increased data storage and access requirements. For instance, a horizontally scaled Memcached cluster can handle growing volumes of cached data and client requests by adding additional cache servers and distributing the workload evenly across nodes to maintain low-latency access. As you might read in Part 1, our fictitious MarsExpress (a local delivery startup based in Albuquerque), uses a distributed caching system to optimize delivery tracking and logistics operations. Here’s how sharding and horizontal scaling play crucial roles in their system. MarsExpress employs sharding in its distributed caching solution to manage real-time tracking data for delivery orders. The system partitions tracking data into geographical regions (shards), with each shard corresponding to deliveries in specific areas (e.g., downtown, suburbs). By distributing data across shards, MarsExpress ensures efficient access and updates to delivery statuses, minimizing latency and optimizing resource utilization. By dividing data into smaller subsets and distributing them among separate cache nodes, each responsible for a specific region, MarsExpress can optimize data access and update speeds. Previously, with a single caching server, latency averaged 20 milliseconds per request. After sharding, this latency can be significantly reduced to 10 milliseconds or less, as data relevant to each region is stored closer to where it is needed most. This approach not only enhances delivery tracking efficiency but also supports scalability as MarsExpress expands its service areas. As MarsExpress expands its delivery services to cover more neighborhoods and handle increasing delivery volumes, horizontal scaling becomes essential. They scale their distributed caching infrastructure horizontally by adding more cache nodes. Each new node enhances system capacity and performance, allowing MarsExpress to handle concurrent requests and store larger datasets without compromising delivery tracking accuracy or responsiveness. on the other hand, involves adding more identical cache nodes to the system to distribute workload and increase overall capacity. In this case, horizontal scaling plays a crucial role in accommodating increased transaction volumes and customer demands. Initially, MarsExpress might handle 5,000 delivery tracking updates per minute with a single caching server. By horizontally scaling and adding more nodes, this capacity can be doubled or even tripled, enabling the system to handle peak delivery periods without compromising performance. This scalability ensures that MarsExpress can maintain real-time visibility into delivery operations, providing customers with accurate tracking information and enhancing overall service reliability. In terms of fault tolerance and availability, the adoption of distributed caching strategies provides our system with improved resilience against system failures. Implementing sharding and maintaining redundant copies of data across multiple nodes, can minimize the risk of service disruptions. For instance, with sharding and redundant caching nodes in place, it can achieve uptime rates of 99.9% or higher. This high availability ensures that customers can track their deliveries seamlessly, even during unforeseen technical issues or maintenance activities. Moreover, the cost efficiency of MarsExpress’s operations is positively impacted by these caching strategies. Initially, operational costs associated with managing and scaling the caching infrastructure may be high due to limited capacity. However, through effective sharding and horizontal scaling, MarsExpress can optimize resource utilization and reduce overhead costs per transaction. In fact, optimizing resource usage through sharding can lead to a 30% reduction in operational costs, while horizontal scaling can further enhance cost efficiency by leveraging economies of scale and improving overall performance metrics Popular Distributed Caches Let’s explore Redis, Memcached, and Apache Ignite in practical scenarios to understand their strengths and use cases. Redis Redis is renowned for its versatility and speed in handling various types of data structures. It’s commonly used as a distributed cache due to its in-memory storage and support for data persistence. In practice, Redis excels in scenarios requiring fast read and write operations, such as session caching, real-time analytics, and leaderboard systems. Its ability to store not just simple key-value pairs but also lists, sets, and sorted sets makes it adaptable to a wide range of caching needs. Redis’s replication and clustering features enhance its resilience and scalability. In real-world applications, setting up Redis as a distributed cache involves configuring master-slave replication or using Redis Cluster for automatic sharding and high availability. These features ensure that even under heavy loads, Redis can maintain performance and reliability. Memcached Memcached is another popular choice for distributed caching, valued for its simplicity and speed. Unlike Redis, Memcached focuses solely on key-value caching without persistence. It’s highly optimized for fast data retrieval and is typically used to alleviate database load by caching frequently accessed data. In practical applications, Memcached shines in scenarios where data volatility isn’t a concern and where rapid access to cached items is critical, such as in web applications handling session data, API responses, and content caching. Its distributed nature allows scaling out by adding more nodes to the cluster, increasing caching capacity, and improving overall performance. Apache Ignite Apache Ignite combines in-memory data grid capabilities with distributed caching and processing. It’s often chosen for applications requiring both caching and computing capabilities in a single platform. In practice, Apache Ignite is used for distributed SQL queries, machine learning model training with cached data, and real-time analytics. What sets Apache Ignite apart is its ability to integrate with existing data sources like RDBMS, NoSQL databases, and Hadoop, making it suitable for hybrid data processing and caching scenarios. Its distributed nature ensures high availability and fault tolerance, critical for handling large-scale datasets and processing complex queries across a cluster. When selecting a distributed cache, practical considerations such as ease of integration, operational overhead, and community support often play a crucial role. From an operational standpoint, configuring and monitoring distributed caches requires expertise in managing clusters, handling failover scenarios, and optimizing cache eviction policies to ensure efficient memory usage. In fact, understanding the trade-offs between consistency, availability, and partition tolerance (CAP theorem) is essential. Distributed caches like Redis and Memcached prioritize availability and partition tolerance, making them suitable for use cases where immediate access to cached data is paramount. Apache Ignite, with its focus on consistency and integration with other data processing frameworks, appeals to applications needing unified data management and computation. Ultimately, the choice of a distributed cache depends on specific application requirements, performance goals, and the operational expertise available. Each of these caches brings unique strengths and trade-offs, making them valuable tools in modern distributed computing environments. Caching Policies Effective caching policies are crucial for optimizing distributed cache performance and reliability. Studies indicate that implementing appropriate caching strategies can reduce database load by up to 70% and improve response times by 80%, significantly enhancing user experience and system efficiency. To illustrate these strategies, let’s revisit MarsExpress, our fictitious startup aiming for global scalability. As MarsExpress expanded, it faced increased load and latency issues, particularly with read-heavy operations. The team implemented several caching policies to address these challenges. Cache-Aside (Lazy Loading) We used this policy to minimize initial load times. When a user requested data not in the cache, the system fetched it from the database, cached it, and returned the result. For example, when users frequently accessed the latest mission updates, the first request after a cache miss would be slightly slower, but subsequent requests were instant. This reduced direct database queries and ensured that frequently accessed data was readily available. For example, Facebook employs cache-aside to manage its massive scale. Frequently accessed user data, like profile information, is fetched from the database upon cache misses, and then cached for subsequent requests. This reduces database load and speeds up response times for users. This approach is preferred when application data access patterns are unpredictable. It ensures that only necessary data is cached, optimizing memory usage and reducing unnecessary cache population. Read-Through To streamline data access, we configured our cache to query the database on a cache miss directly. This approach simplified application logic and ensured that data in the cache was always up-to-date, reducing the complexity of manually managing cache refreshes. For instance, when users looked up historical mission data, the cache would fetch the latest data if not already available, ensuring consistency. Netflix uses read-through caching for its recommendation engine. When a cache miss occurs, the system fetches the latest recommendations from the database and updates the cache, ensuring users always see the most current data. Read-through is better for applications where data consistency is critical, and frequent database updates are needed. It simplifies the development process by abstracting the caching layer from the application code. Write-Through Ensuring data consistency was critical for MarsExpress, especially for transactions. By writing data to the cache and database simultaneously, they maintained synchronization, ensuring users always had access to the most current information without added complexity. This was crucial for real-time telemetry data, where accuracy was paramount. Financial institutions often use write-through caching to ensure transactional data consistency. Every write operation updates both the cache and the database, guaranteeing that cached data is always synchronized with the underlying data store. This policy is ideal for applications requiring strong consistency and immediate data propagation, ensuring that the cache and database remain in sync. Write-Behind (Write-Back) To optimize write performance, MarsExpress adopted a write-behind policy for non-critical data, such as user activity logs. This allowed the cache to handle writes quickly and batch database updates asynchronously, reducing write latency and database load. For example, user feedback and interaction logs were cached and later written to the database in batches, ensuring the system remained responsive. E-commerce platforms like Amazon use write-behind caching for logging user activities and interactions. This ensures fast write performance and reduces the immediate load on the database. This policy is preferred for applications where high write throughput is needed, and eventual consistency is acceptable. It improves performance by deferring database updates. Refresh-Ahead Anticipating user behavior, MarsExpress used refresh-ahead to update cache entries before they expired. By predicting which data would be requested next, they ensured that users experienced minimal latency, particularly during peak times. This was particularly useful for scheduled data releases, where the cache preloaded updates right before they went live. News websites use refresh-ahead to keep their front-page articles updated. By preloading anticipated popular articles, they ensure minimal latency when users access the latest news. This strategy is useful for applications with predictable access patterns. It ensures that frequently accessed data is always fresh, reducing latency during peak access times. Eviction Policies: Ensuring Optimal Cache Performance Managing cache memory efficiently is critical for maintaining high performance and responsiveness. Eviction policies determine which data to remove when the cache reaches its capacity, ensuring that the most relevant data remains accessible. Least Recently Used (LRU) MarsExpress implemented the LRU eviction policy to manage its high volume of data. This policy evicts the least recently accessed items, ensuring that frequently accessed data remains in the cache. For instance, older telemetry data was evicted in favor of newer, more relevant data. Twitter uses LRU eviction to manage tweet caches. Older, less accessed tweets are evicted to make room for new ones, ensuring the cache contains the most relevant data. LRU is effective in scenarios where recently accessed data is likely to be accessed again. It optimizes cache usage by retaining the most relevant data, making it ideal for applications with access patterns that favor recency. Least Frequently Used (LFU) In contrast to LRU, the LFU policy evicts items that are accessed least often. MarsExpress considered LFU for its user profile cache, ensuring that popular profiles remained cached while infrequently accessed profiles were evicted. Content delivery networks (CDNs) often use LFU to manage cached content, ensuring that popular content remains available to users while less popular content is evicted. LFU is beneficial for applications where certain data is accessed repeatedly over a long period. It ensures that the most popular data remains in the cache, optimizing for long-term access patterns. Time-To-Live (TTL) MarsExpress utilized TTL settings to automatically expire stale data. Each cache entry had a defined lifespan, after which it was removed from the cache, ensuring that outdated information did not linger. Online retail platforms like Shopify use TTL to keep product availability and pricing information current. Changes in inventory or price immediately invalidate outdated cache entries. TTL is crucial for applications where data freshness is vital. It ensures that the cache reflects the most current data, reducing the risk of serving stale information. TTL is particularly useful in dynamic environments where data changes frequently. Custom Eviction Policies MarsExpress experimented with custom eviction policies tailored to specific application needs. For example, they combined LRU with TTL for their mission data cache, ensuring both recency and freshness were maintained. Google uses custom eviction policies for its search index, balancing freshness and relevance to provide the most accurate search results. Custom policies offer flexibility to address unique application requirements. They can combine elements of different eviction strategies to optimize cache performance based on specific data access patterns and business needs. By carefully selecting and implementing these eviction policies, MarsExpress ensured that its cache remained performant and responsive, even as data volumes grew. These strategies not only improved system performance but also enhanced the overall user experience, showcasing the importance of well-implemented eviction policies in large-scale system design. Conclusion As MarsExpress continues to evolve and meet the demands of millions, the integration of a distributed caching layer has proven to be pivotal. By strategically employing sharding, horizontal scaling, and carefully chosen caching policies, MarsExpress has optimized performance, enhanced scalability, and ensured data consistency and availability. These strategies have not only improved user experience but have also demonstrated the critical role of distributed caching in modern system design. In Part 3 of our series, we will explore the transition to microservices, delving into how breaking down applications into smaller, independent services can further enhance scalability, resilience, and flexibility. Stay tuned as we continue to guide MarsExpress on its journey to mastering system design.
Imagine coding with a safety net that catches errors before they happen. That's the power of TDD. In this article, we'll dive into how it can revolutionize your development workflow. In Test Driven Development (TDD), a developer writes test cases first before actually writing code to implement the functionality. There are several practical benefits to developing code with the TDD approach such as: Higher quality code: Thinking about tests upfront forces you to consider requirements and design more carefully. Rapid feedback: You get instant validation, reducing the time spent debugging. Comprehensive test coverage: TDD ensures that your entire codebase is thoroughly tested. Refactoring confidence: With a strong test suite, you can confidently improve your code without fear of breaking things. Living documentation: Your tests serve as examples of how the code is meant to be used. TDD has three main phases: Red, Green, and Refactor. The red phase means writing a test case and watching it fail. The green phase means writing minimum code to pass the test case. The refactor phase means improving the code with refactoring for better structure, readability, and maintainability without changing the functionality while ensuring test cases still pass. We will build a Login Page in React, and cover all these phases in detail. The full code for the project is available here, but I highly encourage you to follow along as TDD is as much about the process as it's about the end product. Prerequisites Here are some prerequisites to follow along in this article. Understanding of JavaScript and React NodeJS and NPM installed Code Editor of your choice Initiate a New React App Ensure NodeJS and npm are installed with node -v and npm -v Create a new react app with npx create-react-app tddreact Go to the app folder and start the app with cd tddreact and then npm start Once the app compiles fully, navigate to the localhost. You should see the app loaded. Adding Test Cases As mentioned earlier, in Test-Driven Development (TDD) you start by writing your initial test cases first. Create __tests__ folder under src folder and a filename Login.test.js Time to add your first test case, it is basic in nature ensuring the Login component is present. JavaScript // src/__tests__/Login.test.js import React from 'react'; import { render, fireEvent } from '@testing-library/react'; import '@testing-library/jest-dom/extend-expect'; import Login from '../components/Login'; test('renders Login component', () => { render(<Login />); }); Running the test case with npm test, you should encounter failure like the one below. This is the Red Phase we talked about earlier. Now it's time to add the Login component and initiate the Green Phase. Create a new file under src/components directory and name it Login.js, and add the below code to it. JavaScript // src/components/Login.js import React from 'react'; const Login = () => { return ( <> <p>Hello World!</p> </> ) } export default Login; The test case should pass now, and you have successfully implemented one cycle of the Red to Green phase. Adding Our Inputs On our login page, users should have the ability to enter a username and password and hit a button to log in. Add test cases in which username and password fields should be present on our page. JavaScript test('renders username input field', () => { const { getByLabelText } = render(<Login />); expect(getByLabelText(/username/i)).toBeInTheDocument(); }); test('renders password input field', () => { const { getByLabelText } = render(<Login />); expect(getByLabelText(/password/i)).toBeInTheDocument(); }); test('renders login button', () => { const { getByRole } = render(<Login />); expect(getByRole('button', { name: /login/i })).toBeInTheDocument(); }); You should start to see some test cases failing again. Update the return method of the Login component code as per below, which should make the failing test cases pass. JavaScript // src/components/Login.js return ( <> <div> <form> <div> <label htmlFor="username">Username</label> <input type="text" id="username" /> </div> <div> <label htmlFor="password">Password</label> <input type="password" id="password" /> </div> <button type="submit">Login</button> </form> </div> </> ) Adding Login Logic Now you can add actual login logic. For simplicity, when the user has not entered the username and password fields and hits the login button, an error message should be displayed. When the user has entered both the username and password fields and hits the login button, no error message should be displayed; instead, a welcome message, such as "Welcome John Doe." should appear. These requirements can be captured by adding the following tests to the test file: JavaScript test('shows validation message when inputs are empty and login button is clicked', async () => { const { getByRole, getByText } = render(<Login />) fireEvent.click(getByRole('button', { name: /login/i })); expect(getByText(/please fill in all fields/i)).toBeInTheDocument(); }); test('does not show validation message when inputs are filled and login button is clicked', () => { const handleLogin = jest.fn(); const { getByLabelText, getByRole, queryByText } = render(<Login onLogin={handleLogin} />); fireEvent.change(getByLabelText(/username/i), { target: { value: 'user' } }); fireEvent.change(getByLabelText(/password/i), { target: { value: 'password' } }); fireEvent.click(getByRole('button', { name: /login/i })); expect(queryByText(/welcome john doe/i)).toBeInTheDocument(); }) This should have caused test case failures, verify them using npm test if tests are not running already. Let's implement this feature in the component and pass the test case. Update the Login component code to add missing features as shown below. JavaScript // src/components/Login.js import React, { useState } from 'react'; const Login = () => { const [username, setUsername] = useState(''); const [password, setPassword] = useState(''); const [error, setError] = useState(''); const [isLoggedIn, setIsLoggedIn] = useState(false); const handleSubmit = (e) => { e.preventDefault(); if (!username || !password) { setError('Please fill in all fields'); setIsLoggedIn(false); } else { setError(''); setIsLoggedIn(true); } }; return ( <div> {!isLoggedIn && ( <div> <h1>Login</h1> <form onSubmit={handleSubmit}> <div> <label htmlFor="username">Username</label> <input type="text" id="username" value={username} onChange={(e) => setUsername(e.target.value)} /> </div> <div> <label htmlFor="password">Password</label> <input type="password" id="password" value={password} onChange={(e) => setPassword(e.target.value)} /> </div> <button type="submit">Login</button> </form> {error && <p>{error}</p>} </div> )} {isLoggedIn && <h1>Welcome John Doe</h1>} </div> ); }; export default Login; For most practical scenarios, the Login component should notify the parent component that the user has logged in. Let’s add a test case to cover the feature. After adding this test case, verify your terminal for the failing test case. JavaScript test('notifies parent component after successful login', () => { const handleLogin = jest.fn(); const { getByLabelText, getByText } = render(<Login onLogin={handleLogin} />); fireEvent.change(getByLabelText(/username/i), { target: { value: 'testuser' } }); fireEvent.change(getByLabelText(/password/i), { target: { value: 'password' } }); fireEvent.click(getByText(/login/i)); expect(handleLogin).toHaveBeenCalledWith('testuser'); expect(getByText(/welcome john doe/i)).toBeInTheDocument(); }); Let's implement this feature in the Login component. Update the Login component to receive onLogin function and update handleSubmit as per below. JavaScript const Login = ({ onLogin }) => { /* rest of the Login component code */ const handleSubmit = (e) => { e.preventDefault(); if (!username || !password) { setError('Please fill in all fields'); setIsLoggedIn(false); } else { setError(''); setIsLoggedIn(true); onLogin(username); } }; /* rest of the Login component code */ } Congratulations, the Login component is implemented and all the tests should pass as well. Integrating Login Components to the App create-react-app adds boilerplate code to the App.js file. Let's delete everything from App.js file before you start integrating our Login component. If you see App.test.js file, delete that as well. As again, let's add our test cases for the App component first. Add a new file under __test__ director named App.test.js JavaScript // App.test.js import React from 'react'; import { render, screen, fireEvent } from '@testing-library/react'; import App from '../App'; // Mock the Login component jest.mock('../components/Login', () => (props) => ( <div> <button onClick={props.onLogin}>Mock Login</button> </div> )); describe('App component', () => { test('renders the App component', () => { render(<App ></App>); expect(screen.getByText('Mock Login')).toBeInTheDocument(); }); test('sets isLoggedIn to true when Login button is clicked', () => { render(<App ></App>); const loginButton = screen.getByText('Mock Login'); fireEvent.click(loginButton); expect(screen.getByText('You are logged in.')).toBeInTheDocument(); }); }); Key Insights you can derive from these test cases: The app component holds the Login component and on successful login, a variable like isLoggedIn is needed to indicate the state of the login feature. Once the user is successfully logged in - you need to use this variable and conditionally display the text You are logged in. You are mocking the Login component - this is important as you don’t want the App component’s unit test cases to be testing Login component as well. You already covered the Login component’s test cases earlier. Implement the App component with the features described. Add the below code to App.js file. JavaScript import React, { useState } from 'react'; import logo from './logo.svg'; import './App.css'; import Login from './components/Login'; function App() { const [isLoggedIn, setIsLoggedIn] = useState(false); const onLogin = () => { setIsLoggedIn(true); } return ( <div className="App"> <Login onLogin={onLogin} /> {isLoggedIn && <p>You are logged in.</p>} </div> ); } export default App; All the test cases should pass again now, start the application with npm start. You should see the below page at the localhost. Enhancing Our App Now you have reached a crucial juncture in the TDD process — the Refactor Phase. The Login page’s look and feel is very bare-bone. Let’s enhance it by adding styles and updating the render method of the Login component. Create a new file name Login.css alongside Login.js file and add the below style to it. CSS /* src/components/Login.css */ .login-container { display: flex; justify-content: center; align-items: center; height: 100vh; background-color: #f0f4f8; } .login-form { background: #ffffff; padding: 20px; border-radius: 10px; box-shadow: 0 0 10px rgba(0, 0, 0, 0.1); width: 300px; text-align: center; } .login-form h1 { margin-bottom: 20px; } .login-form label { display: block; text-align: left; margin-bottom: 8px; font-weight: bold; } .login-form input { width: 100%; padding: 10px; margin-bottom: 20px; border: 1px solid #ccc; border-radius: 5px; box-sizing: border-box; } .login-form input:focus { border-color: #007bff; outline: none; box-shadow: 0 0 5px rgba(0, 123, 255, 0.5); } .login-form button { width: 100%; padding: 10px; background-color: #007bff; border: none; color: #fff; font-size: 16px; cursor: pointer; border-radius: 5px; } .login-form button:hover { background-color: #0056b3; } .login-form .error { color: red; margin-bottom: 20px; } Update the render method of the Login component to use styles. Also, import the style file at the top of it. Below is the updated Login component. JavaScript // src/components/Login.js import React, { useState } from 'react'; import './Login.css'; const Login = ({ onLogin }) => { const [username, setUsername] = useState(''); const [password, setPassword] = useState(''); const [error, setError] = useState(''); const [isLoggedIn, setIsLoggedIn] = useState(false); const handleSubmit = (e) => { e.preventDefault(); if (!username || !password) { setError('Please fill in all fields'); setIsLoggedIn(false); } else { setError(''); setIsLoggedIn(true); onLogin(username); } }; return ( <div className="login-container"> {!isLoggedIn && ( <div className="login-form"> <h1>Login</h1> <form onSubmit={handleSubmit}> <div> <label htmlFor="username">Username</label> <input type="text" id="username" value={username} onChange={(e) => setUsername(e.target.value)} /> </div> <div> <label htmlFor="password">Password</label> <input type="password" id="password" value={password} onChange={(e) => setPassword(e.target.value)} /> </div> <button type="submit">Login</button> </form> {error && <p className="error">{error}</p>} </div> )} {isLoggedIn && <h1>Welcome John Doe</h1>} </div> ); }; export default Login; Ensure all test cases still pass with the output of the npm test. Start the app again with npm start — now our app should look like the below: Future Enhancements We have reached the objective for this article but your journey doesn’t need to stop here. I suggest doing further enhancements to the project and continue practicing TDD. Below are a few sample enhancements you can pursue: Advanced validation: Implement more robust validation rules for username and password fields, such as password strength checks or email format validation. Code coverage analysis: Integrate a code coverage tool (like Istanbul) into the testing workflow. This will provide insights into the percentage of code covered by unit tests, and help identify untested code lines and features. Continuous Integration (CI): Set up a CI pipeline (using tools like Jenkins or GitHub Actions) to automatically run tests and generate code coverage reports whenever changes are pushed to the repository. Conclusion In this guide, we've walked through building a React Login page using Test-Driven Development (TDD) step by step. By starting with tests and following the red-green-refactor cycle, we created a solid, well-tested component. TDD might take some getting used to, but the benefits in terms of quality and maintainability are substantial. Embracing TDD will equip you to tackle complex projects with greater confidence.
The development of the .NET platform and C# language moves forward with the launch of .NET 9 and C# 13, introducing a range of enhancements and advancements to boost developer efficiency, speed, and safety. This article delves into upgrades and new features in these releases giving developers a detailed look. Figure courtesy of Microsoft .NET 9 .NET 9 introduces a range of improvements, to the .NET ecosystem with a strong focus on AI and building cloud-native distributed applications by releasing .NET Aspire, boosting performance and enhancements to .NET libraries and frameworks. Here are some notable highlights: .NET Aspire It's an opinionated stack that helps in developing .NET cloud-native applications and services. I recently wrote and published an article related to this on DZone. Performance Improvements .NET 9 is focused on optimizing cloud-native apps, and performance is a key aspect of this optimization. Several performance-related improvements have been made in .NET 9, including: 1. Faster Exceptions Exceptions are now 2-4x faster in .NET 9, thanks to a more modern implementation. This improvement means that your app will spend less time handling exceptions, allowing it to focus on its core functionality. 2. Faster Loops Loop performance has been improved in .NET 9 through loop hoisting and induction variable widening. These optimizations allow loops to run faster and more efficiently, making your app more responsive. 3. Dynamic PGO Improvements Dynamic PGO (Profile-Guided Optimization) has been improved in .NET 9, reducing the cost of type checks. This means that your app will run faster and more efficiently, with less overhead from type checks. 4. RyuJIT Improvements RyuJIT, the .NET Just-In-Time compiler, has been improved in .NET 9 to inline more generic methods. This means that your app will run faster, with less overhead from method calls. 5. Arm64 Code Optimizations Arm64 code can now be written to be much faster using SVE/SVE2 SIMD instructions on Arm64. This means that your app can take advantage of the latest Arm64 hardware, running faster and more efficiently. 6. Server GC Mode The new server GC mode in .NET 9 has been shown to reduce memory usage by up to 2/3 in some benchmarks. This means that your app will use less memory, reducing costs and improving performance. These performance-related improvements in .NET 9 mean that your app will run faster, leaner, and more efficiently. Whether you're building a cloud-native app or a desktop app, .NET 9 has the performance optimizations you need to succeed. AI-Related Improvements These AI-related improvements in .NET enable developers to build powerful applications with AI capabilities, integrate with the AI ecosystem, and monitor and observe AI app performance. Multiple partnerships include Qdrant, Milvus, Weaviate, and more to expand the .NET AI ecosystem. It is easy to integrate with Semantic Kernel, Azure SQL, and Azure AI search. Feature Improvement Benefit Tensor<T> New type for tensors Effective data handling and information flow for learning and prediction Smart Components Prebuilt controls with end-to-end AI features Infuse apps with AI capabilities in minutes OpenAI SDK Official .NET library for OpenAI Delightful experience and parity with other programming languages Monitoring and Observing Features for monitoring and tracing AI apps Reliable, performant, and high-quality outcomes Note: There is some integration work within the .NET Aspire team, semantic kernel, and Azure to utilize the .NET Aspire dashboard to collect and track metrics. Web-Related Improvements Improved performance, security, and reliability Upgrades to existing ASP.NET Core features for modern cloud-native apps Built-in support for OpenAPI document generation Ability to generate OpenAPI documents at build-time or runtime Customizable OpenAPI documents using document and operation transformers These improvements aim to enhance the web development experience with .NET and ASP.NET Core, making it easier to build modern web apps with improved quality and fundamentals. Caching Improvements With HybridCache As one of my favorites, I will explain more in-depth along with code samples in a different article about HybridCache. In short, The HybridCache API in ASP.NET Core is upgraded to provide a more efficient and scalable caching solution. It introduces a multi-tier storage approach, combining in-process (L1) and out-of-process (L2) caches, with features like "stampede" protection and configurable serialization. This results in significantly faster performance, with up to 1000x improvement in high cache hit rate scenarios. C# 13: Introducing New Language Features C# 13 brings forth a range of language elements aimed at enhancing code clarity, maintainability, and developer efficiency. Here are some key additions: params collections: The params keyword is no longer restricted to just array types. It can now be used with any collection type that is recognized, including System.Span<T>, System.ReadOnlySpan<T>, and types that implement System.Collections.Generic.IEnumerable<T> and have an Addmethod. This provides greater flexibility when working with methods that need to accept a variable number of arguments. In the code snippet below, the PrintNumbers method accepts a params of type List<int>[], which means you can pass any number of List<int> arguments to the method. C# public void PrintNumbers(params List<int>[] numbersLists) { foreach (var numbers in numbersLists) { foreach (var number in numbers) { Console.WriteLine(number); } } } PrintNumbers(new List<int> {1, 2, 3}, new List<int> {4, 5, 6}, new List<int> {7, 8, 9}); New lock object: System.Threading.Lock has been introduced to provide better thread synchronization through its API. New escape sequence: You can use \e as a character literal escape sequence for the ESCAPE character, Unicode U+001B. Method group natural type improvements: This feature makes small optimizations to overload resolution involving method groups. Implicit indexer access in object initializers: The ^ operator allows us to use an indexer directly within an object initializer. Conclusion C# 13 and .NET 9 mark a crucial step towards the advancement of C# programming and the .NET environment. The latest release brings a host of new features and improvements that enhance developer productivity, application performance, and security. By staying up-to-date with these changes, developers can leverage these advancements to build more robust, efficient, and secure applications. Happy coding!
Organizations heavily rely on data analysis and automation to drive operational efficiency. In this piece, we will look into the basics of data analysis and automation with examples done in Python which is a high-level programming language used for general-purpose programming. What Is Data Analysis? Data analysis refers to the process of inspecting, cleaning, transforming, and modeling data so as to identify useful information, draw conclusions, and support decision-making. It is an essential activity that helps in transforming raw data into actionable insights. The following are key steps involved in data analysis: Collecting: Gathering data from different sources. Cleaning: Removing or correcting inaccuracies and inconsistencies contained in the collected dataset. Transformation: Converting the collected dataset into a format that is suitable for further analysis. Modeling: Applying statistical or machine learning models on the transformed dataset. Visualization: Representing the findings visually by creating charts, and graphs among others using suitable tools such as MS Excel or Python's matplotlib library. The Significance of Data Automation Data automation involves the use of technology to execute repetitive tasks associated with handling large datasets with minimal human intervention required. Automating these processes can greatly improve their efficiency thereby saving time for analysts who can then focus more on complex duties. Some common areas where it’s employed include: Data ingestion: Automatically collecting and storing data from various sources. Data cleaning and transformation: Using scripts or tools (e.g., Python Pandas library) for preprocessing the collected dataset before performing other operations on it like modeling or visualization. Report generation: Creating automated reports or dashboards that update themselves whenever new records arrive at our system etcetera. Data integration: Combining information obtained from multiple sources so as to get a holistic view when analyzing it further down during the decision-making process. Introduction to Python for Data Analysis Python is a widely used programming language for data analysis due to its simplicity, readability, and vast libraries available for statistical computing. Here are some simple examples that demonstrate how one can read large datasets as well as perform basic analysis using Python: Reading Large Datasets Reading datasets into your environment is one of the initial stages in any data analysis project. For this case, we will need the Pandas library which provides powerful data manipulation and analysis tools. Python import pandas as pd # Define the file path to the large dataset file_path = 'path/to/large_dataset.csv' # Specify the chunk size (number of rows per chunk) chunk_size = 100000 # Initialize an empty list to store the results results = [] # Iterate over the dataset in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): # Perform basic analysis on each chunk # Example: Calculate the mean of a specific column chunk_mean = chunk['column_name'].mean() results.append(chunk_mean) # Calculate the overall mean from the results of each chunk overall_mean = sum(results) / len(results) print(f'Overall mean of column_name: {overall_mean}') Basic Data Analysis Once you have loaded the data, it is important to conduct some preliminary examination on it so as to familiarize yourself with its contents. Performing Aggregated Analysis There are times you might wish to perform a more advanced aggregated analysis over the entire dataset. For instance, let’s say we want to find the sum of a particular column across the whole dataset by processing it in chunks. Python # Initialize a variable to store the cumulative sum cumulative_sum = 0 # Iterate over the dataset in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): # Calculate the sum of the specific column for the current chunk chunk_sum = chunk['column_name'].sum() cumulative_sum += chunk_sum print(f'Cumulative sum of column_name: {cumulative_sum}') Missing Values Treatment in Chunks It is common for missing values to exist during data preprocessing. Instead, here is an instance where missing values are filled using the mean of each chunk. Python # Initialize an empty DataFrame to store processed chunks processed_chunks = [] # Iterate over the dataset in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): # Fill missing values with the mean of the chunk chunk.fillna(chunk.mean(), inplace=True) processed_chunks.append(chunk) # Concatenate all processed chunks into a single DataFrame processed_data = pd.concat(processed_chunks, axis=0) print(processed_data.head()) Final Statistics From Chunks At times, there is a need to get overall statistics from all chunks. This example illustrates how to compute the average and standard deviation of an entire column by aggregating outcomes from each chunk. Python import numpy as np # Initialize variables to store the cumulative sum and count cumulative_sum = 0 cumulative_count = 0 squared_sum = 0 # Iterate over the dataset in chunks for chunk in pd.read_csv(file_path, chunksize=chunk_size): # Calculate the sum and count for the current chunk chunk_sum = chunk['column_name'].sum() chunk_count = chunk['column_name'].count() chunk_squared_sum = (chunk['column_name'] ** 2).sum() cumulative_sum += chunk_sum cumulative_count += chunk_count squared_sum += chunk_squared_sum # Calculate the mean and standard deviation overall_mean = cumulative_sum / cumulative_count overall_std = np.sqrt((squared_sum / cumulative_count) - (overall_mean ** 2)) print(f'Overall mean of column_name: {overall_mean}') print(f'Overall standard deviation of column_name: {overall_std}') Conclusion Reading large datasets in chunks using Python helps in efficient data processing and analysis without overwhelming system memory. By taking advantage of Pandas’ chunking functionality, various tasks involving data analytics can be done on large datasets while ensuring scalability and efficiency. The provided examples illustrate how to read large datasets in portions, address missing values, and perform aggregated analysis; thus providing a strong foundation for working with huge amounts of data in Python.
"Is this feature needed for MVP? Why do we need more budget for our MVP? Why didn't users mention this requirement during MVP definition? Why can't we deliver the MVP faster?" If any of these questions sound familiar, keep reading. If you've ever been part of an Agile team or involved in technology development, you've likely encountered the term "MVP," or Minimum Viable Product. Despite its seemingly straightforward definition, the concept of MVP often leads to confusion and misapplication. Misunderstanding MVP can cause product failures, as teams may incorrectly prioritize "minimum" over "viable." This article aims to demystify MVP and provide clarity on its true meaning and application in product development. Many product development teams, including product managers, owners, developers, and UI/UX researchers, often overemphasize the "minimum" aspect of MVP. It's frequently mistaken for a first release, a functional prototype, or merely a tool for user feedback. Often, MVP is defined based on the available budget or teams’ available capacity. Defining MVP: Breaking Down the Terms Minimum According to Merriam-Webster, "minimum" means the least quantity assignable, admissible, or possible. In product development, this translates to the essential features required to deliver a viable product. "Minimum" here is a qualifier for the more crucial elements: Viable and Product. Viable "Viable" is defined as capable of existence and development as an independent unit. For a product to be viable, it must exist independently and sustain itself in the market. It should not require subsequent developments to meet viability. Economically, a viable product must be sustained indefinitely if market conditions remain stable. Product A "product" literally means something useful or valued. Economists describe it as something with utility. Essentially, a product is a solution, tool, application, or service that meets user needs. It must provide tangible or perceived value to the customer. Breaking Down the Key Attributes of MVP Customer Utility Test An MVP must exist as a product that meets customer utility. A feature without customer value cannot be an MVP. Minimum and viable are qualifiers, but the core lies in MVP being a product first. The priority should be Product > Viable > Minimum. The output should be a product that is viable and meets minimum user requirements. Viability Is Crucial An MVP should be viable, meaning the product shouldn't need additional features for its basic existence. Economic Perspective MVP is an economic concept. It’s defined by customer needs and market trends, not by the budget. The development budget influences iteration cycles but not the MVP itself. MVP and Budget MVP is defined by the list of viable and minimum features to create a product. It requires a fixed budget. The budget determines how far beyond you can take the product ahead of MVP, but there is a minimum budget needed for an MVP, governed by required features. MVP Is Not a Shortcut MVP isn't about cutting corners; it's about delivering a functional and valuable product. How To Achieve MVP In Agile development, reaching an MVP involves iterative product development cycles. It requires a deep understanding of user needs, built through various stages: product design, functional prototyping with actual testers, and iterative development cycles leading to the MVP. Putting MVP in Perspective Let us understand MVP with an example of a car as a product. A car is a product that meets customers' utility for transportation, convenience, and luxury. If you were building a car, it must work and have baseline features such as tires, chassis, seats, power, and steering mechanisms. For a car to be viable, it should be usable by the customer for the entire lifetime of the car, say 15 years, without needing hardware additions. At the MVP stage, an air conditioning feature would be considered a required minimum feature, as it is generally expected by customers. On the contrary, a sunroof feature wouldn’t meet the MVP mark as it isn’t generally required. However, if a sunroof is a key product distinction for your car, then it could be part of the minimum requirement. In the context of technology development, an MVP must be a fully usable application that can be put in the hands of general users. It should have all the features that are generally expected in that specific application. For example, if you are building a website, then a mobile-friendly version is a must-have feature for MVP, but a desktop application isn’t. Conclusion The MVP definition isn’t tricky if the right economic framework is applied. Answer a simple question: can the product be sustained in the hands of users without any interventions? If you answered yes, then you have your MVP. MVP isn't an end state but a stable starting point. It will need updates as customer needs and market conditions evolve. Understanding and defining MVP helps set the right expectations for development teams, project sponsors, and customers. Challenge your product managers to define MVP accurately, and ensure the team knows the product stage.