*You* Can Shape Trend Reports: Join DZone's GenAI Research + Enter the Prize Drawing!
Build a Data Analytics Platform With Flask, SQL, and Redis
Observability and Performance
The dawn of observability across the software ecosystem has fully disrupted standard performance monitoring and management. Enhancing these approaches with sophisticated, data-driven, and automated insights allows your organization to better identify anomalies and incidents across applications and wider systems. While monitoring and standard performance practices are still necessary, they now serve to complement organizations' comprehensive observability strategies. This year's Observability and Performance Trend Report moves beyond metrics, logs, and traces — we dive into essential topics around full-stack observability, like security considerations, AIOps, the future of hybrid and cloud-native observability, and much more.
Getting Started With Agentic AI
Java Application Containerization and Deployment
When developing a product, issues inevitably arise that can impact both its performance and stability. Slow system response times, error rate increases, bugs, and failed updates can all damage the reputation and efficiency of your project. However, before addressing these problems, it is essential to gather and analyze statistics on their occurrence. This data will help you make informed decisions regarding refactoring, optimization, and error-fixing strategies. Step 1: Performance Analysis Performance is a crucial metric that directly affects user experience. To improve it, the first step is to regularly track its key indicators: Monitor server response time. Measure and track response time variations based on time of day, server load, or system changes.Track memory and resource consumption. Regular monitoring helps identify issues early. By analyzing this data, you can assess the quality of releases and patches, detect memory leaks, and plan for hardware upgrades or refactoring.Analyze SQL queries. Gather statistics on the slowest queries and their frequency. There are numerous ways to collect these data points. Various comprehensive Application Performance Monitoring (APM) systems, such as New Relic, Datadog, Dynatrace, AppSignal, and Elastic APM, provide deep insights into your applications. These tools help identify performance bottlenecks, troubleshoot issues, and optimize services by profiling applications and tracking performance at specific code segments. However, these solutions are often paid services; they can be complex to configure or just overkill, especially for smaller teams. Datalog web interface Plain Text slow_query_log = 1 # Enables logging long_query_time = 20 # Defines slow query threshold (seconds) slow_query_log_file = /var/log/mysql/slow-query.log # Log file location log-queries-not-using-indexes = 1 # Log queries without indexes You can then view the log using: Shell tail -f /var/log/mysql/slow-query.log Or: Shell mysqldumpslow /var/log/mysql/slow-query.log Then, you can analyze the query plan in a conventional way using tools like EXPLAIN (EXPLAIN ANALYZE), etc. By the way, for a better understanding of EXPLAIN results, you can use services to visualize problem areas, such as https://explain.dalibo.com or https://explain.depesz.com. expiain.dalibo query plan visualization For real-time server monitoring, Zabbix collects data on memory, CPU, disk usage, network activity, and other critical resources. It can be easily deployed on any operating system and supports push data collection models, auto-registration of agents, custom alerts, and data visualization. Zabbix web interface Another powerful alternative is the Grafana + Prometheus combination. Prometheus collects metrics from servers, applications, databases, and other sources using exporters, stores these metrics in the database, and provides access via the powerful PromQL query language. Grafana connects to Prometheus (and other data sources), allowing the creation of graphs, dashboards, alerts, and reports with an intuitive interface for visualization and filtering. Notably, there are already hundreds of pre-built Prometheus exporters for metric collection, such as: node_exporter, mysql_exporter, nginx_exporter. Grafana dashboard Step 2: Debugging Project Issues Bugs are inevitable, so it is essential not just to fix them, but also to properly track and analyze their reasons and their fixing time. Every new bug or defect should be logged in a task management system, with a corresponding ticket for resolution. This allows for: Correlating bug frequency with product version releases.Measuring bug resolution time.Evaluating debugging efficiency for future planning. If you use Jira, the Jira Dashboards feature provides filtered statistics using JQL (Jira Query Language). The Created vs Resolved Chart offers a clear visualization of bug trends. But before you can analyze fixing times, you should first set up tools for error aggregation and prioritization. ELK Stack (Elasticsearch, Logstash, Kibana) is a popular standard, allowing log collection from multiple sources and storing them in Elasticsearch for deep analysis with Kibana. Kibana web interface However, the ELK Stack is not the only solution. You can also use Grafana Loki. Loki is easier to configure, integrates seamlessly with Grafana for visualization, and is ideal for projects that require a lightweight and user-friendly log management solution. A great approach is to set up error notifications. For example, if a previously unseen error occurs or the number of such errors exceeds a set threshold, the system can notify developers. In some teams, a ticket is automatically created in a task tracker for further investigation and resolution. This helps reduce response time for critical bugs and ensures project stability, especially during frequent releases and updates. Another popular error-tracking tool worth mentioning is Sentry. It easily integrates with any application or web server, allowing log collection from various sources for in-depth analysis. Key features include: Tracking error occurrences and their frequency. Configurable alerts based on specific rules (e.g., sending notifications to a messenger or email). Flexible integrations with task management systems (e.g., automatic bug task creation in Jira). Sentry web interface APM systems such as Datadog or New Relic (mentioned earlier) also provide tools for error analysis. If you're already using an APM solution, it might be a suitable choice for your needs. Finally, user feedback should not be overlooked. Users may report issues that automated systems fail to detect but significantly impact their experience. Since most systems are developed for users, collecting and analyzing their feedback is an invaluable data source that should never be ignored. Step 3: Collecting Product Metrics During both the development and usage stages, issues don’t always manifest directly as bugs or errors. Sometimes, they appear through changes in product metrics. For example, a minor bug or hidden error might lead to a drop in sales, reduced user session duration, or an increase in bounce rates. Such changes can go unnoticed if product metrics are not actively monitored. This is why collecting and tracking product metrics is a crucial part of any project. Metrics help detect problems before they result in significant financial losses and serve as an early warning system for necessary analysis, changes, or optimizations. The specific product metrics to track will vary depending on the type of project, but some are common across industries. These are key examples: User Engagement Metrics Average time spent on the website or in the app Number of active users (DAU – Daily Active Users, MAU – Monthly Active Users) Retention rate – how often users return Financial Metrics Number of sales or subscriptions Average revenue per user (ARPU) Conversion rate – the percentage of users who complete a target action User Acquisition Metrics Advertising campaign effectiveness Bounce rate – percentage of users who leave without interaction Conversion rates from different traffic sources (SEO, social media, email marketing) Each metric should be aligned with business goals. For example, an e-commerce store prioritizes purchase conversion rates, while a media platform focuses on average content watch time. Context Matters When analyzing any metrics (whether technical or product-related), always take external factors into account. Weekends, holidays, marketing campaigns, and seasonal activity spikes all influence system performance and the statistics you collect. Compare data across different time frames: year-over-year, week-over-week, or day-to-day. If your project operates internationally, consider regional differences – local holidays, cultural variations, and user habits can significantly impact results. The effectiveness of changes can vary greatly depending on the audience. For example, performance improvements in one region may have little impact on metrics in another. Conclusion Almost no serious issue can be identified without collecting large amounts of data. Regular monitoring, careful analysis, and consideration of context will help your product grow and evolve under any circumstances. However, keep in mind that collecting excessive data can hinder your analysis rather than help. Focus on gathering only the most relevant metrics and indicators for your specific project.
In the digital age, the ability to find relevant information quickly and accurately has become increasingly critical. From simple web searches to complex enterprise knowledge management systems, search technology has evolved dramatically to meet growing demands. This article explores the journey from index-based basic search engines to retrieval-based generation, examining how modern techniques are revolutionizing information access. The Foundation: Traditional Search Systems Traditional search systems were built on relatively simple principles: matching keywords and ranking results based on relevance, user signals, frequency, positioning, and many more. While effective for basic queries, these systems faced significant limitations. They struggled with understanding context, handling complex multi-part queries, resolving indirect references, performing nuanced reasoning, and providing user-specific personalization. These limitations became particularly apparent in enterprise settings, where information retrieval needs to be both precise and comprehensive. Python from collections import defaultdict import math class BasicSearchEngine: def __init__(self): self.index = defaultdict(list) self.document_freq = defaultdict(int) self.total_docs = 0 def add_document(self, doc_id, content): # Simple tokenization terms = content.lower().split() # Build inverted index for position, term in enumerate(terms): self.index[term].append((doc_id, position)) # Update document frequencies unique_terms = set(terms) for term in unique_terms: self.document_freq[term] += 1 self.total_docs += 1 def search(self, query): terms = query.lower().split() scores = defaultdict(float) for term in terms: if term in self.index: idf = math.log(self.total_docs / self.document_freq[term]) for doc_id, position in self.index[term]: tf = 1 # Simple TF scoring scores[doc_id] += tf * idf return sorted(scores.items(), key=lambda x: x[1], reverse=True) # Usage example search_engine = BasicSearchEngine() search_engine.add_document("doc1", "Traditional search systems use keywords") search_engine.add_document("doc2", "Modern systems employ advanced techniques") results = search_engine.search("search systems") Enterprise Search: Bridging the Gap Enterprise search introduced new complexities and requirements that consumer search engines weren't designed to handle. Organizations needed systems that could search across diverse data sources, respect complex access controls, understand domain-specific terminology, and maintain context across different document types. These challenges drove the development of more sophisticated retrieval techniques, setting the stage for the next evolution in search technology. The Paradigm Shift: From Document Retrieval to Answer Generation The landscape of information access underwent a dramatic transformation in early 2023 with the widespread adoption of large language models (LLMs) and the emergence of retrieval-augmented generation (RAG). Traditional search systems, which primarily focused on returning relevant documents, were no longer sufficient. Instead, organizations needed systems that could not only find relevant information but also provide it in a format that LLMs could effectively use to generate accurate, contextual responses. This shift was driven by several key developments: The emergence of powerful embedding models that could capture semantic meaning more effectively than keyword-based approaches The development of efficient vector databases that could store and query these embeddings at scale The recognition that LLMs, while powerful, needed accurate and relevant context to provide reliable responses The traditional retrieval problem thus evolved into an intelligent, contextual answer generation problem, where the goal wasn't just to find relevant documents, but to identify and extract the most pertinent pieces of information that could be used to augment LLM prompts. This new paradigm required rethinking how we chunk, store, and retrieve information, leading to the development of more sophisticated ingestion and retrieval techniques. Python import numpy as np from transformers import AutoTokenizer, AutoModel import torch class ModernRetrievalSystem: def __init__(self, model_name="sentence-transformers/all-MiniLM-L6-v2"): self.tokenizer = AutoTokenizer.from_pretrained(model_name) self.model = AutoModel.from_pretrained(model_name) self.document_store = {} def _get_embedding(self, text: str) -> np.ndarray: """Generate embedding for a text snippet""" inputs = self.tokenizer(text, return_tensors="pt", max_length=512, truncation=True, padding=True) with torch.no_grad(): outputs = self.model(**inputs) embedding = outputs.last_hidden_state[:, 0, :].numpy() return embedding[0] def chunk_document(self, text: str, chunk_size: int = 512) -> list: """Implement late chunking strategy""" # Get document-level embedding first doc_embedding = self._get_embedding(text) # Chunk the document words = text.split() chunks = [] current_chunk = [] current_length = 0 for word in words: word_length = len(self.tokenizer.encode(word)) if current_length + word_length > chunk_size: chunks.append(" ".join(current_chunk)) current_chunk = [word] current_length = word_length else: current_chunk.append(word) current_length += word_length if current_chunk: chunks.append(" ".join(current_chunk)) return chunks def add_document(self, doc_id: str, content: str): """Process and store document with context-aware chunking""" chunks = self.chunk_document(content) for i, chunk in enumerate(chunks): context = f"Document: {doc_id}, Chunk: {i+1}/{len(chunks)}" enriched_chunk = f"{context}\n\n{chunk}" embedding = self._get_embedding(enriched_chunk) self.document_store[f"{doc_id}_chunk_{i}"] = { "content": chunk, "context": context, "embedding": embedding } The Rise of Modern Retrieval Systems An Overview of Modern Retrieval Using Embedding Models Modern retrieval systems employ a two-phase approach to efficiently access relevant information. During the ingestion phase, documents are intelligently split into meaningful chunks, which preserve context and document structure. These chunks are then transformed into high-dimensional vector representations (embeddings) using neural models and stored in specialized vector databases. During retrieval, the system converts the user's query into an embedding using the same neural model and then searches the vector database for chunks whose embeddings have the highest cosine similarity to the query embedding. This similarity-based approach allows the system to find semantically relevant content even when exact keyword matches aren't present, making retrieval more robust and context-aware than traditional search methods. At the heart of these modern systems lies the critical process of document chunking and retrieval from embeddings, which has evolved significantly over time. Evolution of Document Ingestion The foundation of modern retrieval systems starts with document chunking — breaking down large documents into manageable pieces. This critical process has evolved from basic approaches to more sophisticated techniques: Traditional Chunking Document chunking began with two fundamental approaches: Fixed-size chunking. Documents are split into chunks of exactly specified token length (e.g., 256 or 512 tokens), with configurable overlap between consecutive chunks to maintain context. This straightforward approach ensures consistent chunk sizes but may break natural textual units. Semantic chunking. A more sophisticated approach that respects natural language boundaries while maintaining approximate chunk sizes. This method analyzes the semantic coherence between sentences and paragraphs to create more meaningful chunks Drawbacks of Traditional Chunking Consider an academic research paper split into 512-token chunks. The abstract might be split midway into two chunks, disconnecting the context of its introduction and conclusions. A retrieval model would struggle to identify the abstract as a cohesive unit, potentially missing the paper’s central theme. In contrast, semantic chunking may keep the abstract intact but might struggle with other sections, such as cross-referencing between the discussion and conclusion. These sections might end up in separate chunks, and the links between them could still be missed. Late Chunking: A Revolutionary Approach Legal documents, such as contracts, frequently contain references to clauses defined in other sections. Consider a 50-page employment contract where Section 2 states, 'The Employee shall be subject to the non-compete obligations detailed in Schedule A' while Schedule A, appearing 40 pages later, contains the actual restrictions like 'may not work for competing firms within 100 miles.' If someone searches for 'what are the non-compete restrictions?', traditional chunking that processes sections separately would likely miss this connection — the chunk with Section 2 lacks the actual restrictions, while the Schedule A chunk lacks the context that these are employee obligations Traditional chunking methods would likely split these references across chunks, making it difficult for retrieval models to maintain context. Late chunking, by embedding the entire document first, captures these cross-references seamlessly, enabling precise extraction of relevant clauses during a legal search. Late chunking represents a significant advancement in how we process documents for retrieval. Unlike traditional methods that chunk documents before processing, late chunking: First, processes the entire document through a long context embedding model Creates embeddings that capture the full document context Only then applies chunking boundaries to create final chunk representations This approach offers several advantages: Preserves long-range dependencies between different parts of the document Maintains context across chunk boundaries Improves handling of references and contextual elements Late chunking is particularly effective when combined with reranking strategies, where it has been shown to reduce retrieval failure rates by up to 49% Contextual Enablement: Adding Intelligence to Chunks Consider a 30-page annual financial report where critical information is distributed across different sections. The Executive Summary might mention "ACMECorp achieved significant growth in the APAC region," while the Regional Performance section states, "Revenue grew by 45% year-over-year," the Risk Factors section notes, "Currency fluctuations impacted reported earnings," and the Footnotes clarify "All APAC growth figures are reported in constant currency, excluding the acquisition of TechFirst Ltd." Now, imagine a query like "What was ACME's organic revenue growth in APAC?" A basic chunking system might return just the "45% year-over-year" chunk because it matches "revenue" and "growth." However, this would be misleading as it fails to capture critical context spread across the document: that this growth number includes an acquisition, that currency adjustments were made, and that the number is specifically for APAC. A single chunk in isolation could lead to incorrect conclusions or decisions — someone might cite the 45% as organic growth in investor presentations when, in reality, a significant portion came from M&A activity. One of the major limitations of basic chunking is the loss of context. This method aims to solve that context problem by adding relevant context to each chunk before processing. The process works by: Analyzing the original document to understand the broader context Generating concise, chunk-specific context (typically 50-100 tokens) Prepending this context to each chunk before creating embeddings Using both semantic embeddings and lexical matching (BM25) for retrieval This technique has shown impressive results, reducing retrieval failure rates by up to 49% in some implementations. Evolution of Retrieval Retrieval methods have seen dramatic advancement from simple keyword matching to today's sophisticated neural approaches. Early systems like BM25 relied on statistical term-frequency methods, matching query terms to documents based on word overlap and importance weights. The rise of deep learning brought dense retrieval methods like DPR (Dense Passage Retriever), which could capture semantic relationships by encoding both queries and documents into vector spaces. This enabled matching based on meaning rather than just lexical overlap. More recent innovations have pushed retrieval capabilities further. Hybrid approaches combining sparse (BM25) and dense retrievers help capture both exact matches and semantic similarity. The introduction of cross-encoders allowed for more nuanced relevance scoring by analyzing query-document pairs together rather than independently. With the emergence of large language models, retrieval systems gained the ability to understand and reason about content in increasingly sophisticated ways. Recursive Retrieval: Understanding Relationships Recursive retrieval advances the concept further by exploring relationships between different pieces of content. Instead of treating each chunk as an independent unit, it recognizes that chunks often have meaningful relationships with other chunks or structured data sources. Consider a real-world example of a developer searching for help with a memory leak in a Node.js application: 1. Initial Query "Memory leak in Express.js server handling file uploads." The system first retrieves high-level bug report summaries with similar symptoms A matching bug summary describes: "Memory usage grows continuously when processing multiple file uploads" 2. First Level Recursion From this summary, the system follows relationships to: Detailed error logs showing memory patterns Similar bug reports with memory profiling data Discussion threads about file upload memory management 3. Second Level Recursion Following the technical discussions, the system retrieves: Code snippets showing proper stream handling in file uploads Memory leak fixes in similar scenarios Relevant middleware configurations 4. Final Level Recursion For implementation, it retrieves: Actual code commits diffs that fixed similar issues Unit tests validating the fixes Performance benchmarks before and after fixes At each level, the retrieval becomes more specific and technical, following the natural progression from problem description to solution implementation. This layered approach helps developers not only find solutions but also understand the underlying causes and verification methods. This example demonstrates how recursive retrieval can create a comprehensive view of a problem and its solution by traversing relationships between different types of content. Other applications might include: A high-level overview chunk linking to detailed implementation chunks A summary chunk referencing an underlying database table A concept explanation connecting to related code examples During retrieval, the system not only finds the most relevant chunks but also explores these relationships to gather comprehensive context. Recursive retrieval takes the concept further by exploring relationships between different pieces of content. Instead of treating each chunk as an independent unit, it recognizes that some chunks might have special relationships with others or with structured data sources. For example, in a technical documentation system: A high-level overview chunk might link to detailed implementation chunks A summary chunk might reference an underlying database table A concept explanation might connect to related code examples During retrieval, the system not only finds the most relevant chunks but also explores these relationships to gather comprehensive context. A Special Case of Recursive Retrieval Hierarchical chunking represents a specialized implementation of recursive retrieval, where chunks are organized in a parent-child relationship. The system maintains multiple levels of chunks: Parent chunks – larger pieces providing a broader context Child chunks – smaller, more focused pieces of content The beauty of this approach lies in its flexibility during retrieval: Initial searches can target precise child chunks The system can then "zoom out" to include parent chunks for additional context Overlap between chunks can be carefully managed at each level Python import networkx as nx from typing import Set, Dict, List class RecursiveRetriever: def __init__(self, base_retriever): self.base_retriever = base_retriever self.relationship_graph = nx.DiGraph() def add_relationship(self, source_id: str, target_id: str, relationship_type: str): """Add a relationship between chunks""" self.relationship_graph.add_edge(source_id, target_id, relationship_type=relationship_type) def recursive_search(self, query: str, max_depth: int = 2) -> Dict[str, List[str]]: """Perform recursive retrieval""" results = {} visited = set() # Get initial results initial_results = self.base_retriever.search(query) first_level_ids = [doc_id for doc_id, _ in initial_results] results["level_0"] = first_level_ids visited.update(first_level_ids) # Recursively explore relationships for depth in range(max_depth): current_level_results = [] for doc_id in results[f"level_{depth}"]: related_docs = self._get_related_documents(doc_id, visited) current_level_results.extend(related_docs) visited.update(related_docs) if current_level_results: results[f"level_{depth + 1}"] = current_level_results return results # Usage example retriever = ModernRetrievalSystem() recursive = RecursiveRetriever(retriever) # Add relationships recursive.add_relationship("doc1_chunk_0", "doc2_chunk_0", "related_concept") results = recursive.recursive_search("modern retrieval techniques") Putting It All Together: Modern Retrieval Architecture Modern retrieval systems often combine multiple techniques to achieve optimal results. A typical architecture might: Use hierarchical chunking to maintain document structure Apply contextual embeddings to preserve semantic meaning Implement recursive retrieval to explore relationships Employ reranking to fine-tune results This combination can reduce retrieval failure rates by up to 67% compared to basic approaches. Multi-Modal Retrieval: Beyond Text As organizations increasingly deal with diverse content types, retrieval systems have evolved to handle multi-modal data effectively. The challenge extends beyond simple text processing to understanding and connecting information across images, audio, and video formats. The Multi-Modal Challenge Multi-modal retrieval faces two fundamental challenges: 1. Modality-Specific Complexity Each type of content presents unique challenges. Images, for instance, can range from simple photographs to complex technical diagrams, each requiring different processing approaches. A chart or graph might contain dense information that requires specialized understanding. 2. Cross-Modal Understanding Perhaps the most significant challenge is understanding relationships between different modalities. How does an image relate to its surrounding text? How can we connect a technical diagram with its explanation? These relationships are crucial for accurate retrieval. Solutions and Approaches Modern systems address these challenges through three main approaches: 1. Unified Embedding Space Uses models like CLIP to encode all content types in a single vector space Enables direct comparison between different modalities Simplifies retrieval but may sacrifice some nuanced understanding 2. Text-Centric Transformation Converts all content into text representations Leverages advanced language models for understanding Works well for text-heavy applications but may lose modal-specific details 3. Hybrid Processing Maintains specialized processing for each modality Uses sophisticated reranking to combine results Achieves better accuracy at the cost of increased complexity The choice of approach depends heavily on specific use cases and requirements, with many systems employing a combination of techniques to achieve optimal results. Looking Forward: The Future of Retrieval As AI and machine learning continue to advance, retrieval systems are becoming increasingly sophisticated. Future developments might include: More nuanced understanding of document structure and relationships Better handling of multi-modal content (text, images, video) Improved context preservation across different types of content More efficient processing of larger knowledge bases Conclusion The evolution from basic retrieval to answer generation systems reflects our growing need for more intelligent information access. Organizations can build more effective knowledge management systems by understanding and implementing techniques like contextual retrieval, recursive retrieval, and hierarchical chunking. As these technologies continue to evolve, we can expect even more sophisticated approaches to emerge, further improving our ability to find and utilize information effectively.
Are you a software developer looking to accelerate your career, enhance your skills, and expand your professional network? If so, contributing to an open-source project in 2025 might be your best decision. Open source is more than just a technical exercise; it’s a gateway to learning from industry experts, mastering new technologies, and creating a lasting impact on the developer community. Over the years, one of the most common career-related questions I have encountered is: Why should I participate in an open-source project? With 2025 upon us, this question remains as relevant as ever. In this article, I will explore the reasons for engaging in open source, explain how to get started, and highlight some projects to consider contributing to this year. Why Participate in an Open-Source Project? Using Simon Sinek’s Golden Circle, let’s start with the fundamental question: Why? Participating in an open-source project is one of the best ways to enhance both hard and soft skills as a software engineer. Here’s how: Hard Skills Learn how to write better code by collaborating with some of the best developers in the industry.Gain experience with cutting-edge technologies, such as the latest Java versions, Hibernate best practices, and JVM internals.Expand your knowledge of software design patterns, architecture styles, and problem-solving approaches professionals use worldwide. Soft Skills Improve your communication skills in written discussions (PR reviews, documentation) and real-time interactions. Additionally, enhance your verbal communication skills by participating in meetings, discussions, and presentations, which will help you become more confident in explaining technical concepts to a broader audience.Develop negotiation and persuasion skills when proposing changes or advocating for new features.Expand your professional network, allowing more people to recognize your contributions and capabilities. When you contribute to open source, you distinguish yourself from the vast number of software engineers who use the software. Only a tiny percentage build and maintain these projects. A track record of contributions adds credibility to your resume and LinkedIn profile, making you stand out in the job market. How to Get Started in Open Source A common misconception is that contributing to open source is complicated or reserved for experts. This is not true. Anyone can start contributing by following a structured approach. Here are five steps to begin your open-source journey: 1. Choose a Project Select a project that aligns with your interests or career goals. To become a database expert, contribute to an open-source database. If you want to improve your API development skills, work on frameworks related to API design. Since open-source contributions often start as a hobby in your free time, ensure that the project provides valuable learning opportunities and supports your career aspirations. 2. Join the Team Communication Channels Once you have selected a project, join the community. Open-source projects use various communication channels such as Slack, Discord, mailing lists, or forums. Introduce yourself and observe discussions, pull requests, and issue tracking to understand how the community operates. 3. Read the Documentation Documentation is the bridge between you, as a contributor, and the project maintainers. Many developers rely on tutorials, blog posts, and YouTube videos, but reading the official documentation gives you a deeper understanding of how the project works. This also helps you identify documentation gaps that you can improve later. 4. Start with Tests, Documentation, and Refactoring Before jumping into feature development, focus on tasks that are valuable but often overlooked, such as: Improving documentation clarity.Writing tests to increase code coverage.Refactoring legacy code to align with modern Java features (e.g., replacing Java 5 code with Java 17 constructs like Streams and Lambdas). These contributions are always welcome, and since they are difficult to reject, they serve as a great entry point into any project. 5. Propose Enhancements and New Features Once you have built credibility within the project by handling documentation, testing, and refactoring tasks, you can propose enhancements and new features. Many developers start by suggesting new features immediately, but without familiarity with the project's goals and context, such proposals may be disregarded. Establishing yourself first as a reliable contributor makes it easier for your ideas to be accepted and integrated into the project. Open-Source Projects to Contribute to in 2025 If you are looking for projects to contribute to this year, consider well-established ones under foundations like Eclipse and Apache, as well as other impactful open-source projects: Jakarta Data – for those interested in Java persistence and data accessJakarta NoSQL – ideal for developers exploring NoSQL databases with Jakarta EEEclipse JNoSQL – a great entry point for those working with NoSQL in JavaWeld – a core implementation of CDI (Contexts and Dependency Injection)Spring Framework – one of the most widely used frameworks in Java developmentQuarkus – a Kubernetes-native Java stack tailored for GraalVM and cloud-native applicationsOracle NoSQL – a high-performance distributed NoSQL database for enterprise applicationsMongoDB – a widely-used NoSQL document database for modern applications Conclusion In this article, I explained why participating in open source is beneficial, how to start contributing, and which projects to consider in 2025. Contrary to popular belief, contributing is not difficult — it simply requires time, discipline, and consistency. I have been contributing to open source for over a decade, and chances are, you are already using some of the projects I have worked on. I hope this guide helps you get started, and I look forward to seeing you on a mailing list or a pull request soon! Video
DZone events bring together industry leaders, innovators, and peers to explore the latest trends, share insights, and tackle industry challenges. From Virtual Roundtables to Fireside Chats, our events cover a wide range of topics, each tailored to provide you, our DZone audience, with practical knowledge, meaningful discussions, and support for your professional growth. DZone Events Happening Soon Below, you’ll find upcoming events that you won't want to miss. Modernizing Enterprise Java Applications: Jakarta EE, Spring Boot, and AI Integration Date: February 25, 2025Time: 1:00 PM ET Register for Free! Unlock the potential of AI integration in your enterprise Java applications with our upcoming webinar! Join Payara and DZone to explore how to enhance your Spring Boot and Jakarta EE systems using generative AI tools like Spring AI and REST client patterns. What to Consider When Building an IDP Date: March 4, 2025Time: 1:00 PM ET Register for Free! Is your development team bogged down by manual tasks and “TicketOps”? Internal Developer Portals (IDPs) streamline onboarding, automate workflows, and enhance productivity—but should you build or buy? Join Harness and DZone for a webinar to explore key IDP capabilities, compare Backstage vs. managed solutions, and learn how to drive adoption while balancing cost and flexibility. DevOps for Oracle Applications with FlexDeploy: Automation nd Compliance Made Easy Date: March 11, 2025Time: 1:00 PM ET Register for Free! Join Flexagon and DZone as Flexagon's CEO unveils how FlexDeploy is helping organizations future-proof their DevOps strategy for Oracle Applications and Infrastructure. Explore innovations for automation through compliance, along with real-world success stories from companies who have adopted FlexDeploy. Make AI Your App Development Advantage: Learn Why and How Date: March 12, 2025Time: 10:00 AM ET Register for Free! The future of app development is here, and AI is leading the charge. Join Outsystems and DZone, on March 12th at 10am ET, for an exclusive Webinar with Luis Blando, CPTO of OutSystems, and John Rymer, industry analyst at Analysis.Tech, as they discuss how AI and low-code are revolutionizing development.You will also hear from David Gilkey, Leader of Solution Architecture, Americas East at OutSystems, and Roy van de Kerkhof, Director at NovioQ. This session will give you the tools and knowledge you need to accelerate your development and stay ahead of the curve in the ever-evolving tech landscape. Developer Experience: The Coalescence of Developer Productivity, Process Satisfaction, and Platform Engineering Date: March 12, 2025Time: 1:00 PM ET Register for Free! Explore the future of developer experience at DZone’s Virtual Roundtable, where a panel will dive into key insights from the 2025 Developer Experience Trend Report. Discover how AI, automation, and developer-centric strategies are shaping workflows, productivity, and satisfaction. Don’t miss this opportunity to connect with industry experts and peers shaping the next chapter of software development. Unpacking the 2025 Developer Experience Trends Report: Insights, Gaps, and Putting it into Action Date: March 19, 2025Time: 1:00 PM ET Register for Free! We’ve just seen the 2025 Developer Experience Trends Report from DZone, and while it shines a light on important themes like platform engineering, developer advocacy, and productivity metrics, there are some key gaps that deserve attention. Join Cortex Co-founders Anish Dhar and Ganesh Datta for a special webinar, hosted in partnership with DZone, where they’ll dive into what the report gets right—and challenge the assumptions shaping the DevEx conversation. Their take? Developer experience is grounded in clear ownership. Without ownership clarity, teams face accountability challenges, cognitive overload, and inconsistent standards, ultimately hampering productivity. Don’t miss this deep dive into the trends shaping your team’s future. What's Next? DZone has more in store! Stay tuned for announcements about upcoming Webinars, Virtual Roundtables, Fireside Chats, and other developer-focused events. Whether you’re looking to sharpen your skills, explore new tools, or connect with industry leaders, there’s always something exciting on the horizon. Don’t miss out — save this article and check back often for updates!
In recent years, cloud-native applications have become the go-to standard for many businesses to build scalable applications. Among the many advancements in cloud technologies, serverless architectures stand out as a transformative approach. Ease-of-use and efficiency are the two most desirable properties for modern application development, and serverless architectures offer these. This has made serverless the game changer for both the cloud providers and the consumers. For companies that are looking to build applications with this approach, major cloud providers offer several serverless solutions. In this article, we will explore the features, benefits, and challenges of this architecture, along with use cases. In this article, I used AWS as an example to explore the concepts, but the same concepts are applicable across all major cloud providers. Serverless Serverless does not mean there are no servers. It simply means that the underlying infrastructure for those services is managed by the cloud providers. This allows the architects and developers to design and build the applications without worrying about managing the infrastructure. It is similar to using the ride-sharing app Uber: when you need a ride, you don’t worry about owning or maintaining a car. Uber handles all that, and you just focus on getting where you need to go by paying for the ride. Serverless architectures offer many benefits that make them suitable and attractive for many use cases. Here are some of the key advantages: Auto Scaling One of the biggest advantages of serverless architecture is that it inherently supports scaling. Cloud providers handle the heavy lifting to offer near-infinite, out-of-the-box scalability. For instance, if an app built using Serverless technologies suddenly gains popularity, the tools or services automatically scale to meet the app’s needs. We don’t have to wake up in the middle of the night to add the servers or other resources. Focus on Innovation Since you are no longer burdened with managing servers, you can instead focus on building the application, adding features towards app’s growth. For any organization, whether small, medium, or large, this approach helps in concentrating on what truly matters — business growth. Cost Efficiency With traditional server models, you often end up paying for unused resources as they are bought upfront and managed even when they are not in use. Serverless changes this by switching to a pay-as-you-use model. In most of the scenarios, you only pay for the resources that you actually use. If the app you build doesn’t get traction right away, your costs will be minimal, like paying for a single session instead of an entire year. As the app’s traffic grows, the cost will grow accordingly. Faster Time-to-Market With serverless frameworks, you can build and deploy applications much faster compared to traditional server models. When the app is ready, it can be deployed with minimal effort using serverless resources. Instead of spending time on server management, you can focus on development and adding new features, shipping them at a faster pace. Reduced Operational Maintenance Since cloud providers manage the infrastructure, the consumers need not worry about provisioning, maintaining, scaling, or handling security patches and vulnerabilities. Serverless frameworks offer flexibility and can be applied to a variety of use cases. Whether it is building web applications or processing real-time data, they provide the scalability and efficiency needed for these use cases. Building Web Service APIs With AWS Serverless Now that we have discussed the benefits of serverless architectures, let us dive into some practical examples. In this section, we will create a simple backend web application using AWS serverless resources. The above backend application design contains three layers to provide APIs for a web application. Once deployed on AWS, the gateway endpoint is available for API consumption. When the APIs are called by the users, the requests are routed through the API gateway to appropriate lambda functions. For each API request, Lambda function gets triggered, and it accesses the DynamoDB to store and retrieve data. This design is a streamlined, cost-effective solution that scales automatically as demand grows, making it an ideal choice for building APIs with minimal overhead. The components in this design integrate well with each other providing flexibility. There are two major components in this architecture — computing and storage. Serverless Computing Serverless computing changed the way cloud-native applications and services are built and deployed. It promises a real pay-as-you-go model with millisecond-level granularity without wasting any resources. Due to its simplicity and economic advantages, this approach gained popularity, and many cloud providers support these capabilities. The simplest way to use serverless computing is by providing code to be executed by the platform on demand. This approach led to the rise of Function-as-a-service (FaaS) platforms focused on allowing small pieces of code represented as functions to run for a limited amount of time. The functions are triggered by events like HTTP requests, storage changes, messages, or notifications. As these functions are invoked and stopped when the code execution is complete, they don’t keep any persistent state. To maintain the state or persist the data, they use services like DynamoDB which provide durable storage capabilities. AWS Lambda is capable of scaling as per the demand. For example, AWS Lambda processed more than 1.3 trillion invocations on Prime Day 2024. Such capabilities are crucial in handling the sudden spurts of traffic. Serverless Storage In the serverless computing ecosystem, serverless storage refers to cloud-based storage solutions that scale automatically without having the consumers manage the infrastructure. These services offer many capabilities, including on-demand scalability, high availability, and pay-as-you-go. For instance, DynamoDB is a fully managed, serverless NoSQL database designed to handle key-value and document data models. It is purpose-built for applications requiring consistent performance at any scale, offering single-digit millisecond latency. It also provides seamless integration capabilities with many other services. Major cloud providers offer numerous serverless storage options for specific needs, such as S3, ElastiCache, Aurora, and many more. Other Use Cases In the previous section, we discussed how to leverage serverless architecture to build backend APIs for a web application. There are several other use cases that can benefit from serverless architecture. A few of those use cases include: Data Processing Let’s explore another example of how serverless architecture can be used to notify services based on data changes in a datastore. For instance, in an e-commerce platform, let’s say on the creation of an order, several services need to be informed. Within the AWS ecosystem, the order can be stored in DynamoDB upon creation. In order to notify other services, multiple events can be triggered based on this storage event. Using DynamoDB Streams, a Lambda function can be invoked when this event occurs. This lambda function can then push the change event to SNS (Simple Notification Service). SNS acts as the notification service to notify several other services that are interested in these events. Real-Time File Processing In many applications, users upload images that need to be stored, processed for resizing, converted to different formats, and analyzed. We can achieve this functionality using AWS serverless architecture in the following way. When an image is uploaded, it is pushed to an S3 bucket configured to trigger an event to invoke a Lambda function. The Lambda function can process the image, store metadata in DynamoDB, and store resized images in another S3 bucket. This scalable architecture can be used to process millions of images without requiring to manage any infrastructure or any manual intervention. Challenges Serverless architectures offer many benefits, but they also bring certain challenges that need to be addressed. Cold Start When a serverless function is invoked, the platform needs to create, initialize, and run a new container to execute the code. This process, known as cold start, can introduce additional latency in the workflow. Techniques like keeping functions warm or using provisioned concurrency can help reduce this delay. Monitoring and Debugging As there can be a large number of invocations, monitoring and debugging can become complex. It can be challenging to identify and debug issues in applications that are heavily used. Configuring tools like AWS Cloudwatch for metrics, logs, and alerts is highly recommended to address these issues. Although serverless architectures scale automatically, the resource configurations must be optimized to prevent bottlenecks. Proper resource allocation and implementation of cost optimization strategies are essential. Conclusion The serverless architecture is a major step towards the development of cloud-native applications backed by serverless computing and storage. It is heavily used in many types of applications, including event-driven workflows, data processing, file processing, and big data analytics. Due to its scalability, agility, and high availability, serverless architecture has become a reliable choice for businesses of all sizes.
There are many situations where you may need to export data from XML to MongoDB. Despite the fact that XML and JSON(B) formats used in MongoDB have much in common, they also have a number of differences that make them non-interchangeable. Therefore, before you face the task of exporting data from XML to MongoDB, you will need to: Write your own XML parsing scripts;Use ETL tools. Although modern language models can write parsing scripts quite well in languages like Python, these scripts will have a serious problem — they won't be unified. For each file type, modern language models will generate a separate script. If you have more than one type of XML, this already creates significant problems in maintaining more than one parsing script. The above problem is usually solved using specialized ETL tools. In this article, we will look at an ETL tool called SmartXML. Although SmartXML also supports converting XML to a relational representation we will only look at the process of uploading XML into MongoDB. The actual XML can be extremely large and complex. This article is an introductory article, so we will dissect a situation in which: All XML has the same structure;The logical model of the XML is the same as the storage model in MongoDB;Extracted fields don't need complex processing; We'll cover those cases later, but first, let's examine a simple example: XML <marketingData> <customer> <name>John Smith</name> <email>john.smith@example.com</email> <purchases> <purchase> <product>Smartphone</product> <category>Electronics</category> <price>700</price> <store>TechWorld</store> <location>New York</location> <purchaseDate>2025-01-10</purchaseDate> </purchase> <purchase> <product>Wireless Earbuds</product> <category>Audio</category> <price>150</price> <store>GadgetStore</store> <location>New York</location> <purchaseDate>2025-01-11</purchaseDate> </purchase> </purchases> <importantInfo> <loyaltyStatus>Gold</loyaltyStatus> <age>34</age> <gender>Male</gender> <membershipID>123456</membershipID> </importantInfo> <lessImportantInfo> <browser>Chrome</browser> <deviceType>Mobile</deviceType> <newsletterSubscribed>true</newsletterSubscribed> </lessImportantInfo> </customer> <customer> <name>Jane Doe</name> <email>jane.doe@example.com</email> <purchases> <purchase> <product>Laptop</product> <category>Electronics</category> <price>1200</price> <store>GadgetStore</store> <location>San Francisco</location> <purchaseDate>2025-01-12</purchaseDate> </purchase> <purchase> <product>USB-C Adapter</product> <category>Accessories</category> <price>30</price> <store>TechWorld</store> <location>San Francisco</location> <purchaseDate>2025-01-13</purchaseDate> </purchase> <purchase> <product>Keyboard</product> <category>Accessories</category> <price>80</price> <store>OfficeMart</store> <location>San Francisco</location> <purchaseDate>2025-01-14</purchaseDate> </purchase> </purchases> <importantInfo> <loyaltyStatus>Silver</loyaltyStatus> <age>28</age> <gender>Female</gender> <membershipID>654321</membershipID> </importantInfo> <lessImportantInfo> <browser>Safari</browser> <deviceType>Desktop</deviceType> <newsletterSubscribed>false</newsletterSubscribed> </lessImportantInfo> </customer> <customer> <name>Michael Johnson</name> <email>michael.johnson@example.com</email> <purchases> <purchase> <product>Headphones</product> <category>Audio</category> <price>150</price> <store>AudioZone</store> <location>Chicago</location> <purchaseDate>2025-01-05</purchaseDate> </purchase> </purchases> <importantInfo> <loyaltyStatus>Bronze</loyaltyStatus> <age>40</age> <gender>Male</gender> <membershipID>789012</membershipID> </importantInfo> <lessImportantInfo> <browser>Firefox</browser> <deviceType>Tablet</deviceType> <newsletterSubscribed>true</newsletterSubscribed> </lessImportantInfo> </customer> <customer> <name>Emily Davis</name> <email>emily.davis@example.com</email> <purchases> <purchase> <product>Running Shoes</product> <category>Sportswear</category> <price>120</price> <store>FitShop</store> <location>Los Angeles</location> <purchaseDate>2025-01-08</purchaseDate> </purchase> <purchase> <product>Yoga Mat</product> <category>Sportswear</category> <price>40</price> <store>FitShop</store> <location>Los Angeles</location> <purchaseDate>2025-01-09</purchaseDate> </purchase> </purchases> <importantInfo> <loyaltyStatus>Gold</loyaltyStatus> <age>25</age> <gender>Female</gender> <membershipID>234567</membershipID> </importantInfo> <lessImportantInfo> <browser>Edge</browser> <deviceType>Mobile</deviceType> <newsletterSubscribed>false</newsletterSubscribed> </lessImportantInfo> </customer> <customer> <name>Robert Brown</name> <email>robert.brown@example.com</email> <purchases> <purchase> <product>Smartwatch</product> <category>Wearable</category> <price>250</price> <store>GadgetPlanet</store> <location>Boston</location> <purchaseDate>2025-01-07</purchaseDate> </purchase> <purchase> <product>Fitness Band</product> <category>Wearable</category> <price>100</price> <store>HealthMart</store> <location>Boston</location> <purchaseDate>2025-01-08</purchaseDate> </purchase> </purchases> <importantInfo> <loyaltyStatus>Silver</loyaltyStatus> <age>37</age> <gender>Male</gender> <membershipID>345678</membershipID> </importantInfo> <lessImportantInfo> <browser>Chrome</browser> <deviceType>Mobile</deviceType> <newsletterSubscribed>true</newsletterSubscribed> </lessImportantInfo> </customer> </marketingData> In this example, we will upload in the MongoDB only the fields that serve a practical purpose, rather than the entire XML. Create a New Project It is recommended to create a new project from the GUI. This will automatically create the necessary folder structure and parsing rules. A full description of the project structure can be found in the official documentation. All parameters described in this article can be configured in graphical mode, but for clarity, we will focus on the textual representation. In addition to the config.txt file with project settings, job.txt for batch work, the project itself consists of: Template of intermediate internal SmartDOM view, located in the project folder templates/data-templates.red.Rules for processing and transformation of SmartDOM itself, located in the rules folder. Let's consider the structure of data-templates.red: Plain Text #[ sample: #[ marketing_data: #[ customers: [ customer: [ name: none email: none purchases: [ purchase: [ product: none category: none price: none store: none location: none purchase_date: none ] ] ] ] ] ] ] Note The name sample is the name of the category, and it doesn't matter.The marketing_data is the name of the subcategory. We need at least one code subcategory (subtype).The intermediate view names don't require exact matches with XML tag names. In this example, we intentionally used the snake_case style. Extract Rules The rules are located in the rules directory in the project folder. When working with MongoDB we will only be interested in two rules: tags-matching-rules.red — sets the matches between the XML tag tree and SmartDOMgrow-rules.red — describes the relationship between SmartDOM nodes and real XML nodes Plain Text sample: [ purchase: ["purchase"] customer: ["customer"] ] The key will be the name of the node in SmartDOM; the value will be an array containing the node spelling variants from the real XML file. In our example, these names are the same. Ignored Tags To avoid loading minor data into MongoDB in the example above, we create files in the ignores folder — one per section, named after each section. These files contain lists of tags to skip during extraction. For our example, we'll have a sample.txt file containing: Plain Text ["marketingData" "customer" "lessImportantInfo" "browser"] ["marketingData" "customer" "lessImportantInfo" "deviceType"] ["marketingData" "customer" "lessImportantInfo" "newsletterSubscribed"] As a result, when analyzing morphology, the intermediate representation will take the next form: Plain Text customers: [ customer: [ name: "John Smith" email: "john.smith@example.com" loyalty_status: "Gold" age: "34" gender: "Male" membership_id: "123456" purchases: [ purchase: [ product: "Smartphone" category: "Electronics" price: "700" store: "TechWorld" location: "New York" purchase_date: "2025-01-10" ] ] ] ] Note that after morphological analysis, only a minimal representation is shown containing data from the first found nodes. Here's the JSON file that will be generated: JSON { "customers": [ { "name": "John Smith", "email": "john.smith@example.com", "loyalty_status": "Gold", "age": "34", "gender": "Male", "membership_id": "123456", "purchases": [ { "product": "Smartphone", "category": "Electronics", "price": "700", "store": "TechWorld", "location": "New York", "purchase_date": "2025-01-10" }, { "product": "Wireless Earbuds", "category": "Audio", "price": "150", "store": "GadgetStore", "location": "New York", "purchase_date": "2025-01-11" } ] }, { "name": "Jane Doe", "email": "jane.doe@example.com", "loyalty_status": "Silver", "age": "28", "gender": "Female", "membership_id": "654321", "purchases": [ { "product": "Laptop", "category": "Electronics", "price": "1200", "store": "GadgetStore", "location": "San Francisco", "purchase_date": "2025-01-12" }, { "product": "USB-C Adapter", "category": "Accessories", "price": "30", "store": "TechWorld", "location": "San Francisco", "purchase_date": "2025-01-13" }, { "product": "Keyboard", "category": "Accessories", "price": "80", "store": "OfficeMart", "location": "San Francisco", "purchase_date": "2025-01-14" } ] }, { "name": "Michael Johnson", "email": "michael.johnson@example.com", "loyalty_status": "Bronze", "age": "40", "gender": "Male", "membership_id": "789012", "purchases": [ { "product": "Headphones", "category": "Audio", "price": "150", "store": "AudioZone", "location": "Chicago", "purchase_date": "2025-01-05" } ] }, { "name": "Emily Davis", "email": "emily.davis@example.com", "loyalty_status": "Gold", "age": "25", "gender": "Female", "membership_id": "234567", "purchases": [ { "product": "Running Shoes", "category": "Sportswear", "price": "120", "store": "FitShop", "location": "Los Angeles", "purchase_date": "2025-01-08" }, { "product": "Yoga Mat", "category": "Sportswear", "price": "40", "store": "FitShop", "location": "Los Angeles", "purchase_date": "2025-01-09" } ] }, { "name": "Robert Brown", "email": "robert.brown@example.com", "loyalty_status": "Silver", "age": "37", "gender": "Male", "membership_id": "345678", "purchases": [ { "product": "Smartwatch", "category": "Wearable", "price": "250", "store": "GadgetPlanet", "location": "Boston", "purchase_date": "2025-01-07" }, { "product": "Fitness Band", "category": "Wearable", "price": "100", "store": "HealthMart", "location": "Boston", "purchase_date": "2025-01-08" } ] } ] } Configuring Connection to MongoDB Since MongoDB doesn't support direct HTTP data insertion, an intermediary service will be required. Let's install the dependencies: pip install flask pymongo. The service itself: Python from flask import Flask, request, jsonify from pymongo import MongoClient import json app = Flask(__name__) # Connection to MongoDB client = MongoClient('mongodb://localhost:27017') db = client['testDB'] collection = db['testCollection'] @app.route('/insert', methods=['POST']) def insert_document(): try: # Flask will automatically parse JSON if Content-Type: application/json data = request.get_json() if not data: return jsonify({"error": "Empty JSON payload"}), 400 result = collection.insert_one(data) return jsonify({"insertedId": str(result.inserted_id)}), 200 except Exception as e: import traceback print(traceback.format_exc()) return jsonify({"error": str(e)}), 500 if __name__ == '__main__': app.run(port=3000) We'll set up the MongoDB connection settings in the config.txt file (see nosql-url): Plain Text job-number: 1 root-xml-folder: "D:/data/data-samples" xml-filling-stat: false ; table: filling_percent_stat should exists ignore-namespaces: false ignore-tag-attributes: false use-same-morphology-for-same-file-name-pattern: false skip-schema-version-tag: true use-same-morphology-for-all-files-in-folder: false delete-data-before-insert: none connect-to-db-at-project-opening: true source-database: "SQLite" ; available values: PostgreSQL/SQLite target-database: "SQLite" ; available values: PostgreSQL/SQLite/NoSQL bot-chatID: "" bot-token: "" telegram-notifications: true db-driver: "" db-server: "127.0.0.1" db-port: "" db-name: "" db-user: "" db-pass: "" sqlite-driver-name: "SQLite3 ODBC Driver" sqlite-db-path: "" nosql-url: "http://127.0.0.1:3000/insert" append-subsection-name-to-nosql-url: false no-sql-login: "" ; login and pass are empty no-sql-pass: "" Remember that MongoDB will automatically create a database and a collection of the same name if they do not exist. However, this behavior may cause errors, and it is recommended to disable it by default. Let's run the service itself: Python python .\app.py Next, click Parse, then Send JSON to NoSQL. Now connect to the MongoDB console in any convenient way and execute the following commands: Plain Text show databases admin 40.00 KiB config 72.00 KiB local 72.00 KiB testDB 72.00 KiB use testDB switched to db testDB db.testCollection.find().pretty() The result should look like the following: JSON { _id: ObjectId('278e1b2c7c1823d4fde120ef'), customers: [ { name: 'John Smith', email: 'john.smith@example.com', loyalty_status: 'Gold', age: '34', gender: 'Male', membership_id: '123456', purchases: [ { product: 'Smartphone', category: 'Electronics', price: '700', store: 'TechWorld', location: 'New York', purchase_date: '2025-01-10' }, { product: 'Wireless Earbuds', category: 'Audio', price: '150', store: 'GadgetStore', location: 'New York', purchase_date: '2025-01-11' } ] }, { name: 'Jane Doe', email: 'jane.doe@example.com', loyalty_status: 'Silver', age: '28', gender: 'Female', membership_id: '654321', purchases: [ { product: 'Laptop', category: 'Electronics', price: '1200', store: 'GadgetStore', location: 'San Francisco', purchase_date: '2025-01-12' }, { product: 'USB-C Adapter', category: 'Accessories', price: '30', store: 'TechWorld', location: 'San Francisco', purchase_date: '2025-01-13' }, { product: 'Keyboard', category: 'Accessories', price: '80', store: 'OfficeMart', location: 'San Francisco', purchase_date: '2025-01-14' } ] }, { name: 'Michael Johnson', email: 'michael.johnson@example.com', loyalty_status: 'Bronze', age: '40', gender: 'Male', membership_id: '789012', purchases: [ { product: 'Headphones', category: 'Audio', price: '150', store: 'AudioZone', location: 'Chicago', purchase_date: '2025-01-05' } ] }, { name: 'Emily Davis', email: 'emily.davis@example.com', loyalty_status: 'Gold', age: '25', gender: 'Female', membership_id: '234567', purchases: [ { product: 'Running Shoes', category: 'Sportswear', price: '120', store: 'FitShop', location: 'Los Angeles', purchase_date: '2025-01-08' }, { product: 'Yoga Mat', category: 'Sportswear', price: '40', store: 'FitShop', location: 'Los Angeles', purchase_date: '2025-01-09' } ] }, { name: 'Robert Brown', email: 'robert.brown@example.com', loyalty_status: 'Silver', age: '37', gender: 'Male', membership_id: '345678', purchases: [ { product: 'Smartwatch', category: 'Wearable', price: '250', store: 'GadgetPlanet', location: 'Boston', purchase_date: '2025-01-07' }, { product: 'Fitness Band', category: 'Wearable', price: '100', store: 'HealthMart', location: 'Boston', purchase_date: '2025-01-08' } ] } ] } Conclusion In this example, we have seen how we can automate the uploading of XML files to MongoDB without having to write any code. Although the example considers only one file, it is possible within the framework of one project to a huge number of types and subtypes of files with different structures, as well as to perform quite complex manipulations, such as type conversion and the use of external services to process field values in real time. This allows not only the unloading of data from XML but also the processing of some of the values via external API, including the use of large language models.
ReactJS has become a go-to library for building dynamic and responsive user interfaces. However, as applications grow, managing asynchronous data streams becomes more challenging. Enter RxJS, a powerful library for reactive programming using observables. RxJS operators simplify handling complex asynchronous data flows, making your React components more manageable and efficient. In this article, we'll explore RxJS operators within the context of ReactJS. We'll walk through step-by-step examples, demonstrating how to integrate RxJS into your React applications. By the end of this guide, you'll have a solid understanding of RxJS operators and how they can enhance your ReactJS projects. What Is RxJS? RxJS, or Reactive Extensions for JavaScript, is a library that allows you to work with asynchronous data streams using observables. An observable is a collection that arrives over time, enabling you to react to changes in data efficiently. But why use RxJS in ReactJS? ReactJS is inherently stateful and deals with UI rendering. Incorporating RxJS allows you to handle complex asynchronous operations like API calls, event handling, and state management with greater ease and predictability. Why Should You Use RxJS in ReactJS? Improved Asynchronous Handling In ReactJS, handling asynchronous operations like API calls or user events can become cumbersome. RxJS operators like map, filter, and debounceTime allow you to manage these operations elegantly, transforming data streams as they flow through your application. Cleaner and More Readable Code RxJS promotes a functional programming approach, making your code more declarative. Instead of managing state changes and side effects manually, you can leverage RxJS operators to handle these tasks concisely. Enhanced Error Handling RxJS provides powerful error-handling mechanisms, allowing you to gracefully manage errors in your asynchronous operations. Operators like catchError and retry can automatically recover from errors without cluttering your code with try-catch blocks. Setting Up RxJS in a ReactJS Project Before diving into the code, let's set up a basic ReactJS project with RxJS installed. JavaScript npx create-react-app rxjs-react-example cd rxjs-react-example npm install rxjs Once you have RxJS installed, you're ready to start integrating it into your React components. Step-by-Step Example Let's walk through a detailed example of using RxJS in a ReactJS application. We'll create a simple app that fetches data from an API and displays it in a list. We'll use RxJS operators to handle the asynchronous data stream efficiently. Step 1: Creating a Simple React Component First, create a new component called DataFetcher.js: JavaScript import React, { useEffect, useState } from 'react'; const DataFetcher = () => { const [data, setData] = useState([]); const [error, setError] = useState(null); return ( <div> <h1>Data Fetcher</h1> {error && <p>Error: {error}</p>} <ul> {data.map(item => ( <li key={item.id}>{item.name}</li> ))} </ul> </div> ); }; export default DataFetcher; This component initializes state variables for data and error. It renders a list of data fetched from an API and handles errors gracefully. Step 2: Importing RxJS and Creating an Observable Next, we'll import RxJS and create an observable for fetching data. In the same DataFetcher.js file, modify the component to include the following: JavaScript import { of, from } from 'rxjs'; import { catchError, map } from 'rxjs/operators'; import { ajax } from 'rxjs/ajax'; const fetchData = () => { return ajax.getJSON('https://jsonplaceholder.typicode.com/users').pipe( map(response => response), catchError(error => of({ error: true, message: error.message })) ); }; Here, we use the ajax.getJSON method from RxJS to fetch data from an API. The map operator transforms the response, and catchError handles any errors, returning an observable that we can subscribe to. Step 3: Subscribing to the Observable in useEffect Now, we'll use the useEffect hook to subscribe to the observable and update the component state accordingly: JavaScript useEffect(() => { const subscription = fetchData().subscribe({ next: (result) => { if (result.error) { setError(result.message); } else { setData(result); } }, error: (err) => setError(err.message), }); return () => subscription.unsubscribe(); }, []); This code subscribes to the fetchData observable. If the observable emits an error, it updates the error state; otherwise, it updates the data state. The subscription is cleaned up when the component unmounts to prevent memory leaks. Step 4: Enhancing the Data Fetching Process Now that we have a basic implementation, let's enhance it using more RxJS operators. For example, we can add a loading state and debounce the API calls to optimize performance. JavaScript import { debounceTime, tap } from 'rxjs/operators'; const fetchData = () => { return ajax.getJSON('https://jsonplaceholder.typicode.com/users').pipe( debounceTime(500), tap(() => setLoading(true)), map(response => response), catchError(error => of({ error: true, message: error.message })), tap(() => setLoading(false)) ); }; In this enhanced version, debounceTime ensures that the API call is only made after 500ms of inactivity, reducing unnecessary requests. The tap operator sets the loading state before and after the API call, providing visual feedback to the user. Common RxJS Operators and Their Usage in ReactJS RxJS offers a wide range of operators that can be incredibly useful in ReactJS applications. Here are a few common operators and how they can be used: map The map operator transforms each value emitted by an observable. In ReactJS, it can be used to format data before rendering it in the UI. JavaScript const transformedData$ = fetchData().pipe( map(data => data.map(item => ({ ...item, fullName: `${item.name} (${item.username})` }))) ); filter The filter operator allows you to filter out values that don't meet certain criteria. This is useful for displaying only relevant data to the user. JavaScript const filteredData$ = fetchData().pipe( filter(item => item.isActive) ); debounceTime debounceTime delays the emission of values from an observable, making it ideal for handling user input events like search queries. JavaScript const searchInput$ = fromEvent(searchInput, 'input').pipe( debounceTime(300), map(event => event.target.value); ); switchMap switchMap is perfect for handling scenarios where only the latest result of an observable matters, such as autocomplete suggestions. JavaScript const autocomplete$ = searchInput$.pipe( switchMap(query => ajax.getJSON(`/api/search?q=${query}`)) ); Advanced RxJS and ReactJS Integration: Leveraging More Operators and Patterns Combining Observables With merge Sometimes, you need to handle multiple asynchronous streams simultaneously. The merge operator allows you to combine multiple observables into a single observable, emitting values from each as they arrive. JavaScript import { merge, of, interval } from 'rxjs'; import { map } from 'rxjs/operators'; const observable1 = interval(1000).pipe(map(val => `Stream 1: ${val}`)); const observable2 = interval(1500).pipe(map(val => `Stream 2: ${val}`)); const combined$ = merge(observable1, observable2); useEffect(() => { const subscription = combined$.subscribe(value => { console.log(value); // Logs values from both streams as they arrive }); return () => subscription.unsubscribe(); }, []); In a React app, you can use merge to simultaneously listen to multiple events or API calls and handle them in a unified manner. Real-Time Data Streams With interval and scan For applications requiring real-time updates, such as stock tickers or live dashboards, RxJS can create and process streams effectively. JavaScript import { interval } from 'rxjs'; import { scan } from 'rxjs/operators'; const ticker$ = interval(1000).pipe( scan(count => count + 1, 0) ); useEffect(() => { const subscription = ticker$.subscribe(count => { console.log(`Tick: ${count}`); // Logs ticks every second }); return () => subscription.unsubscribe(); }, []); In this example, scan acts like a reducer, maintaining a cumulative state across emissions. Advanced User Input Handling With combineLatest For complex forms or scenarios where multiple input fields interact, the combineLatest operator is invaluable. JavaScript import { fromEvent, combineLatest } from 'rxjs'; import { map } from 'rxjs/operators'; const emailInput = document.getElementById('email'); const passwordInput = document.getElementById('password'); const email$ = fromEvent(emailInput, 'input').pipe( map(event => event.target.value) ); const password$ = fromEvent(passwordInput, 'input').pipe( map(event => event.target.value) ); const form$ = combineLatest([email$, password$]).pipe( map(([email, password]) => ({ email, password })) ); useEffect(() => { const subscription = form$.subscribe(formData => { console.log('Form Data:', formData); }); return () => subscription.unsubscribe(); }, []); This example listens to multiple input fields and emits the latest values together, simplifying form state management. Retry Logic With retryWhen and delay In scenarios where network reliability is an issue, RxJS can help implement retry mechanisms with exponential backoff. JavaScript import { ajax } from 'rxjs/ajax'; import { retryWhen, delay, scan } from 'rxjs/operators'; const fetchData = () => { return ajax.getJSON('https://api.example.com/data').pipe( retryWhen(errors => errors.pipe( scan((retryCount, err) => { if (retryCount >= 3) throw err; return retryCount + 1; }, 0), delay(2000) ) ) ); }; useEffect(() => { const subscription = fetchData().subscribe({ next: data => setData(data), error: err => setError(err.message) }); return () => subscription.unsubscribe(); }, []); This approach retries the API call up to three times, with a delay between attempts, improving user experience during transient failures. Loading Indicators With startWith To provide a seamless user experience, you can show a loading indicator until data is available by using the startWith operator. JavaScript import { ajax } from 'rxjs/ajax'; import { startWith } from 'rxjs/operators'; const fetchData = () => { return ajax.getJSON('https://jsonplaceholder.typicode.com/users').pipe( startWith([]) // Emit an empty array initially ); }; useEffect(() => { const subscription = fetchData().subscribe(data => { setData(data); }); return () => subscription.unsubscribe(); }, []); This ensures the UI displays a placeholder or spinner until data is loaded. Cancelable Requests With takeUntil Handling cleanup of asynchronous operations is critical, especially for search or dynamic queries. The takeUntil operator helps cancel observables. JavaScript import { Subject } from 'rxjs'; import { ajax } from 'rxjs/ajax'; import { debounceTime, switchMap, takeUntil } from 'rxjs/operators'; const search$ = new Subject(); const cancel$ = new Subject(); const searchObservable = search$.pipe( debounceTime(300), switchMap(query => ajax.getJSON(`https://api.example.com/search?q=${query}`).pipe( takeUntil(cancel$) ) ) ); useEffect(() => { const subscription = searchObservable.subscribe(data => { setData(data); }); return () => cancel$.next(); // Cancel ongoing requests on unmount }, []); const handleSearch = (query) => search$.next(query); Here, takeUntil ensures that any ongoing API calls are canceled when a new query is entered, or the component unmounts. FAQs What Is the Difference Between RxJS and Redux? RxJS focuses on managing asynchronous data streams using observables, while Redux is a state management library. RxJS can be used with Redux to handle complex async logic, but they serve different purposes. Can I Use RxJS With Functional Components? Yes, RxJS works seamlessly with React's functional components. You can use hooks like useEffect to subscribe to observables and manage side effects. Is RxJS Overkill for Small React Projects? For small projects, RxJS might seem like overkill. However, as your project grows and you need to handle complex asynchronous data flows, RxJS can simplify your code and make it more maintainable. How Do I Debug RxJS in ReactJS? Debugging RxJS code can be done using tools like the Redux DevTools or RxJS-specific logging operators like tap to inspect emitted values at various stages. How Do I Optimize for High-Frequency Events? Operators like throttleTime and auditTime are ideal for handling high-frequency events like scrolling or resizing. Can RxJS Replace React State Management Libraries? RxJS is not a state management solution but can complement libraries like Redux for handling complex async logic. For smaller projects, RxJS with BehaviorSubject can sometimes replace state management libraries. What Are Best Practices for RxJS in ReactJS? Use takeUntil for cleanup in useEffect to avoid memory leaks.Avoid overusing RxJS for simple synchronous state updates; prefer React's built-in tools for that.Test observables independently to ensure reliability. Conclusion RxJS is a powerful tool for managing asynchronous data in ReactJS applications. Using RxJS operators, you can write cleaner, more efficient, and maintainable code. Understanding and applying RxJS in your ReactJS projects will significantly enhance your ability to handle complex asynchronous data flows, making your applications more scalable.
Hello, mate! Today, let’s talk about what database migrations are and why they’re so important. In today’s world, it’s no surprise that any changes to a database should be done carefully and according to a specific process. Ideally, these steps would be integrated into our CI/CD pipeline so that everything runs automatically. Here’s our agenda: What’s the problem?How do we fix it?A simple exampleA more complex exampleRecommendationsResultsConclusion What’s the Problem? If your team has never dealt with database migrations and you’re not entirely sure why they’re needed, let’s sort that out. If you already know the basics, feel free to skip ahead. Main Challenge When we make “planned” and “smooth” changes to the database, we need to maintain service availability and meet SLA requirements (so that users don’t suffer from downtime or lag). Imagine you want to change a column type in a table with 5 million users. If you do this “head-on” (e.g., simply run ALTER TABLE without prep), the table could get locked for a significant amount of time — and your users would be left without service. To avoid such headaches, follow two rules: Apply migrations in a way that doesn’t lock the table (or at least minimizes locks).If you need to change a column type, it’s often easier to create a new column with the correct type first and then drop the old one afterward. Another Problem: Version Control and Rollbacks Sometimes you need to roll back a migration. Doing this manually — going into the production database and fiddling with data — is not only risky but also likely impossible if you don’t have direct access. That’s where dedicated migration tools come in handy. They let you apply changes cleanly and revert them if necessary. How Do We Fix It? Use the Right Tools Each language and ecosystem has its own migration tools: For Java, Liquibase or Flyway are common.For Go, a popular choice is goose (the one we’ll look at here).And so on. Goose: What It Is and Why It’s Useful Goose is a lightweight Go utility that helps you manage migrations automatically. It offers: Simplicity. Minimal dependencies and a transparent file structure for migrations.Versatility. Supports various DB drivers (PostgreSQL, MySQL, SQLite, etc.).Flexibility. Write migrations in SQL or Go code. Installing Goose Shell go install github.com/pressly/goose/v3/cmd/goose@latest How It Works: Migration Structure By default, Goose looks for migration files in db/migrations. Each migration follows this format: Shell NNN_migration_name.(sql|go) NNN is the migration number (e.g., 001, 002, etc.).After that, you can have any descriptive name, for example init_schema.The extension can be .sql or .go. Example of an SQL Migration File: 001_init_schema.sql: SQL -- +goose Up CREATE TABLE users ( id SERIAL PRIMARY KEY, username VARCHAR(255) NOT NULL, created_at TIMESTAMP NOT NULL DEFAULT now() ); -- +goose Down DROP TABLE users; Our First Example Changing a Column Type (String → Int) Suppose we have a users table with a column age of type VARCHAR(255). Now we want to change it to INTEGER. Here’s what the migration might look like (file 005_change_column_type.sql): SQL -- +goose Up ALTER TABLE users ALTER COLUMN age TYPE INTEGER USING (age::INTEGER); -- +goose Down ALTER TABLE users ALTER COLUMN age TYPE VARCHAR(255) USING (age::TEXT); What’s happening here: Up migration We change the age column to INTEGER. The USING (age::INTEGER) clause tells PostgreSQL how to convert existing data to the new type.Note that this migration will fail if there’s any data in age that isn’t numeric. In that case, you’ll need a more complex strategy (see below). Down migration If we roll back, we return age to VARCHAR(255).We again use USING (age::TEXT) to convert from INTEGER back to text. The Second and Complex Cases: Multi-Step Migrations If the age column might contain messy data (not just numbers), it’s safer to do this in several steps: Add a new column (age_int) of type INTEGER.Copy valid data into the new column, dealing with or removing invalid entries.Drop the old column. SQL -- +goose Up -- Step 1: Add a new column ALTER TABLE users ADD COLUMN age_int INTEGER; -- Step 2: Try to move data over UPDATE users SET age_int = CASE WHEN age ~ '^[0-9]+$' THEN age::INTEGER ELSE NULL END; -- (optional) remove rows where data couldn’t be converted -- DELETE FROM users WHERE age_int IS NULL; -- Step 3: Drop the old column ALTER TABLE users DROP COLUMN age; -- +goose Down -- Step 1: Recreate the old column ALTER TABLE users ADD COLUMN age VARCHAR(255); -- Step 2: Copy data back UPDATE users SET age = age_int::TEXT; -- Step 3: Drop the new column ALTER TABLE users DROP COLUMN age_int; To allow a proper rollback, the Down section just mirrors the actions in reverse. Automation is Key To save time, it’s really convenient to add migration commands to a Makefile (or any other build system). Below is an example Makefile with the main Goose commands for PostgreSQL. Let’s assume: The DSN for the database is postgres://user:password@localhost:5432/dbname?sslmode=disable.Migration files are in db/migrations. Shell # File: Makefile DB_DSN = "postgres://user:password@localhost:5432/dbname?sslmode=disable" MIGRATIONS_DIR = db/migrations # Install Goose (run once) install-goose: go install github.com/pressly/goose/v3/cmd/goose@latest # Create a new SQL migration file new-migration: ifndef NAME $(error Usage: make new-migration NAME=your_migration_name) endif goose -dir $(MIGRATIONS_DIR) create $(NAME) sql # Apply all pending migrations migrate-up: goose -dir $(MIGRATIONS_DIR) postgres $(DB_DSN) up # Roll back the last migration migrate-down: goose -dir $(MIGRATIONS_DIR) postgres $(DB_DSN) down # Roll back all migrations (be careful in production!) migrate-reset: goose -dir $(MIGRATIONS_DIR) postgres $(DB_DSN) reset # Check migration status migrate-status: goose -dir $(MIGRATIONS_DIR) postgres $(DB_DSN) status How to Use It? 1. Create a new migration (SQL file). This generates a file db/migrations/002_add_orders_table.sql. Shell make new-migration NAME=add_orders_table 2. Apply all migrations. Goose will create a schema_migrations table in your database (if it doesn’t already exist) and apply any new migrations in ascending order. Shell make migrate-up 3. Roll back the last migration. Just down the last one. Shell make migrate-down 4. Roll back all migrations (use caution in production). Full reset. Shell make migrate-reset 5. Check migration status. Shell make migrate-status Output example: Shell $ goose status $ Applied At Migration $ ======================================= $ Sun Jan 6 11:25:03 2013 -- 001_basics.sql $ Sun Jan 6 11:25:03 2013 -- 002_next.sql $ Pending -- 003_and_again.go Summary By using migration tools and a Makefile, we can: Restrict direct access to the production database, making changes only through migrations.Easily track database versions and roll them back if something goes wrong.Maintain a single, consistent history of database changes.Perform “smooth” migrations that won’t break a running production environment in a microservices world.Gain extra validation — every change will go through a PR and code review process (assuming you have those settings in place). Another advantage is that it’s easy to integrate all these commands into your CI/CD pipeline. And remember — security above all else. For instance: YAML jobs: migrate: runs-on: ubuntu-latest steps: - name: Install Goose run: | make install-goose - name: Run database migrations env: DB_DSN: ${{ secrets.DATABASE_URL } run: | make migrate-up Conclusion and Tips The main ideas are so simple: Keep your migrations small and frequent. They’re easier to review, test, and revert if needed.Use the same tool across all environments so dev, stage, and prod are in sync.Integrate migrations into CI/CD so you’re not dependent on any one person manually running them. In this way, you’ll have a reliable and controlled process for changing your database structure — one that doesn’t break production and lets you respond quickly if something goes wrong. Good luck with your migrations! Thanks for reading!
Apache Doris, a high-performance, real-time analytical database, boasts an impressive underlying architecture and code design. For developers, mastering source code compilation and debugging is key to understanding Doris’s core. However, the build process involves multiple toolchains and dependency configurations, and during debugging, you may encounter various complex issues that can leave beginners feeling overwhelmed. This article walks you through the process from source code to runtime, providing a detailed analysis of Apache Doris’s compilation and debugging procedures. From environment setup and code checkout to troubleshooting common issues, we combine practical examples to help you quickly get started with Doris development and debugging. Overview Have you ever wondered how a SQL query is parsed and executed from start to finish? In Apache Doris, this process involves multiple core components and complex internal mechanisms. This article will guide you through the journey from source code to runtime, offering a comprehensive analysis of Doris’s build and debugging process, and helping you gain a deep understanding of SQL execution principles. 1. Environment Basic Environment Computer configuration. MacBook Pro (Chip: Apple M1, macOS: 15.1)JDK. Version 17Doris branch. Use the Doris Master branch (specifically, the branch-2.1) Installing Environment Dependencies When using Homebrew, the installed JDK version is 17 because on macOS the arm64 version of Homebrew does not include JDK 8 by default. Currently, Doris supports only JDK8 and JDK17. PowerShell brew install automake autoconf libtool pkg-config texinfo coreutils gnu-getopt \ python@3 cmake ninja ccache bison byacc gettext wget pcre maven llvm@16 openjdk@17 npm Dependency Explanation 1. Java, Maven, etc. These can be downloaded separately for easier management. On macOS, Zulu JDK17 is recommended.Maven can be downloaded from the official Maven website.Manually downloaded Java and Maven must be configured in your environment variables. 2. Other dependencies’ environment variables (example for Apple Silicon Macs): PowerShell export PATH=/opt/homebrew/opt/llvm/bin:$PATH export PATH=/opt/homebrew/opt/bison/bin:$PATH export PATH=/opt/homebrew/opt/texinfo/bin:$PATH ln -s -f /opt/homebrew/bin/python3 /opt/homebrew/bin/python Add the above configurations to your ~/.bashrc or ~/.zshrc file and run source ~/.bashrc or source ~/.zshrc to apply the changes. Installing Thrift Note: Thrift needs to be installed only when you are debugging just the FE (Frontend). When debugging both BE (Backend) and FE, the BE third-party libraries already include Thrift. Plain Text MacOS: 1. Download: `brew install thrift@0.16.0` 2. Create a symbolic link: `mkdir -p ./thirdparty/installed/bin` # Apple Silicon 芯片 macOS `ln -s /opt/homebrew/Cellar/thrift@0.16.0/0.16.0/bin/thrift ./thirdparty/installed/bin/thrift` # Intel 芯片 macOS `ln -s /usr/local/Cellar/thrift@0.16.0/0.16.0/bin/thrift ./thirdparty/installed/bin/thrift` Note: Running `brew install thrift@0.16.0` on macOS may report that the version cannot be found. To resolve this, execute the following commands in the terminal: 1. `brew tap homebrew/core --force` 2. `brew tap-new $USER/local-tap` 3. `brew extract --version='0.16.0' thrift $USER/local-tap` 4. `brew install thrift@0.16.0` Reference: `https://gist.github.com/tonydeng/02e571f273d6cce4230dc8d5f394493c` Fetching Your Code Clone your code by executing the following commands: PowerShell cd ~ mkdir DorisDev cd DorisDev git clone <https://github.com/GitHubID/doris.git> Setting Environment Variables PowerShell export DORIS_HOME=~/DorisDev/doris export PATH=$DORIS_HOME/bin:$PATH Downloading Doris Build Dependencies 1. Visit the Apache Doris Third Party Prebuilt page (link) to find the source code for all third-party libraries. You can directly download doris-thirdparty-source.tgz. 2. Alternatively, you can download the precompiled third-party libraries from the same page, which saves you from compiling these libraries yourself. Refer to the commands below. PowerShell cd thirdparty rm -rf installed # For Intel Macs: curl -L <https://github.com/apache/doris-thirdparty/releases/download/automation/doris-thirdparty-prebuilt-darwin-x86_64.tar.xz> \\ -o - | tar -Jxf - # For Apple Silicon Macs: curl -L <https://github.com/apache/doris-thirdparty/releases/download/automation/doris-thirdparty-prebuilt-darwin-arm64.tar.xz> \\ -o - | tar -Jxf - # Verify that protoc and thrift run correctly: cd installed/bin ./protoc --version ./thrift --versio When running protoc and thrift, you might encounter issues opening them due to developer verification problems. In that case, navigate to Security & Privacy and click the Open Anyway button in the General tab to confirm that you want to open the binary. For more details, refer to Apple Support. Increase the System Maximum File Descriptor Limit After modifying, run source on the corresponding file to apply the changes. PowerShell # For bash: echo 'ulimit -n 65536' >>~/.bashrc # For zsh: echo 'ulimit -n 65536' >>~/.zshrc 2. Compiling Doris Navigate to your Doris home directory and run the build script: PowerShell cd $DORIS_HOME # Compile the entire Doris project: sh build.sh # Or compile only FE and BE: sh build.sh --fe --be If you want to speed up the build process and you do not require the FE frontend page, you can comment out the FE UI build section in the build.sh script: Shell # FE UI must be built before building FE #if [[ "${BUILD_FE}" -eq 1 ]]; then # if [[ "${BUILD_UI}" -eq 1 ]]; then # build_ui # fi #fi After a successful compilation, you should see output similar to the following: 3. Debugging Configuring the Debug Environment This guide covers debugging the Doris FE only. Plain Text # Copy the compiled package to a separate directory: cp -r output/ ../doris-run # Configure FE/BE settings: 1. Set the IP and directories. 2. For BE, add the extra configuration: min_file_descriptor_number = 10000. Start debugging using IntelliJ IDEA. Important: Do not open the root directory of the Doris project; instead, open the FE directory to avoid conflicts with CLion. Generating FE Code Open the IDEA terminal, navigate to the root directory of the code, and execute: PowerShell sh generated-source.sh Wait until you see the message “Done”. Configuring Debug for FE 1. Edit configurations. 2. Add a DorisFE configuration. Click the + icon in the upper left to add an Application configuration. Refer to the image below for the specific settings. 3. Working directory. Set it to the fe directory within the source code. 4. Environment variables. Configure the environment variables similarly to those exported in fe/bin/start_fe.sh in the Doris root directory. The DORIS_HOME variable should point to the directory you copied earlier during setup. Plain Text JAVA_OPTS=-Xmx8092m; LOG_DIR=/Users/abc/DorisDev/doris-run/fe/log; PID_DIR=/Users/abc/DorisDev/doris-run/fe/log; DORIS_HOME=/Users/abc/DorisDev/doris-run/fe Starting FE Click Run or Debug. This will trigger the build process for FE; once completed, the FE will start. In this guide, we choose Debug. Starting BE Since you have already copied the compiled package to the doris-run directory, start the BE from within that directory: PowerShell sh bin/start_be.sh --daemon Debugging FE 1. Connect to the FE. Use a MySQL client or DBeaver to connect to the FE launched by IDEA. MySQL mysql -uroot -h127.0.0.1 -P9030 2. Add the BE node to the cluster. MySQL alter system add backend "127.0.0.1:9050"; 3. Set breakpoints in code. Locate the ConnectProcessor code in the project: Set a breakpoint at the handleQuery method. When you execute a query, the debugger will hit the breakpoint, and you can start an enjoyable debugging journey. For instance, if you are working on the Doris syntax migration task mentioned in previous sessions, you can use debugging to iteratively refine your code. FAQs Question 1 During compilation, you might encounter a lock conflict error: Plain Text Could not acquire lock(s) Answer Delete any .lock files in your local Maven repository by running: Plain Text find ~/.m2/repository -name "*.lock" -delete Question 2 During compilation, an error caused by a newer version of Node.js may occur: Plain Text opensslErrorStack: ['error:03000086:digital envelope routines::initialization error'] library: 'digital envelope routines' reason: 'unsupported' code: 'ERR_OSSL_EVP_UNSUPPORTED' Answer Set Node.js to use the legacy OpenSSL provider by executing: Plain Text # Instruct Node.js to use the legacy OpenSSL provider export NODE_OPTIONS=--openssl-legacy-provider Reference: StackOverflow Discussion Question 3 IntelliJ IDEA fails to start FE with the error: Plain Text java: OutOfMemoryError: insufficient memory Answer Maven’s compiler may not have enough memory. Increase the memory allocation as shown below: Question 5 IntelliJ IDEA fails to start FE with the error: Plain Text java: cannot find symbol Symbol: class GeneratedMemoPatterns Location: package org.apache.doris.nereids.pattern Answer Resolve this issue by executing the following commands in the Doris root directory: Plain Text mv fe/fe-core/target/generated-sources/annotations/org/apache/doris/nereids/pattern/ fe/fe-core/target/generated-sources/org/apache/doris/ mv fe/fe-core/target/generated-sources/cup/org/apache/doris/analysis/ fe/fe-core/target/generated-sources/org/apache/doris/ Question 5 In some versions, compilation may fail with the error: Plain Text error: reference to ‘detail’ is ambiguous Answer Modify the code according to this PR or execute the following commands: Plain Text wget https://github.com/apache/doris/pull/43868.patch git apply 43868.patch Question 6 In some versions during debugging, the FE on port 9030 fails to start, and fe.log reports: Plain Text Can not find help zip file: help-resource.zip Answer Navigate to the doris/docs directory, execute the following commands, and then restart FE: Plain Text cd doris/docs sh build_help_zip.sh cp -r build/help-resource.zip ../fe/fe-core/target/classes By following this guide, you should be able to set up your environment, compile, and debug Apache Doris with greater ease. Happy debugging!
On one side, U.S. laws expand data access in the name of national security. On the other hand, French SecNumCloud ensures digital independence for European businesses. Let’s break down the implications of these two models on cybersecurity, compliance, and the protection of critical infrastructure. Part I - Context and Challenges of Data Sovereignty Introduction The USA PATRIOT Act and the French SecNumCloud framework reflect two opposing visions of digital data management. The United States prioritizes national security, with laws allowing extraterritorial access to data stored by American companies. In contrast, France and Europe promote a sovereign and secure approach. Together, they aim to protect sensitive data from foreign interference. The USA PATRIOT Act: Broad Government Access The USA PATRIOT Act was passed in 2001 after the September 11 attacks to expand government agencies' powers in surveillance and counterterrorism. In practice, it grants U.S. authorities broad surveillance capabilities, allowing access to data from companies under American jurisdiction, regardless of where it is stored. The adoption of the CLOUD Act in 2018 further strengthened this authority. It requires American companies to provide data upon request, even if the data is stored on servers located in Europe. The extraterritorial nature of these laws forces American companies to hand over data to U.S. authorities, including data stored in Europe. This creates a direct conflict with the GDPR. For European businesses using American cloud services, it opens the door to potential surveillance of their strategic and sensitive data. Beyond confidentiality concerns, this situation raises a real challenge to digital sovereignty, as it questions Europe’s ability to manage its own data independently and securely. SecNumCloud: Strengthening Digital Sovereignty In response to these challenges, France developed SecNumCloud, a cybersecurity certification issued by ANSSI (the National Cybersecurity Agency in France). It ensures that cloud providers adhere to strict security and data sovereignty standards. SecNumCloud-certified providers must meet strict requirements to safeguard data integrity and sovereignty against foreign interference. First, cloud infrastructure and operations must remain entirely under European control, ensuring no external influence — particularly from the United States or other third countries — can be exerted. Additionally, no American company can hold a stake or exert decision-making power over data management, preventing any legal obligation to transfer data to foreign authorities under the CLOUD Act. Just as importantly, clients retain full control over access to their data. They are guaranteed that their data cannot be used or transferred without their explicit consent. With these measures, SecNumCloud prevents foreign interference and ensures a sovereign cloud under European control, fully compliant with the GDPR. This allows European businesses and institutions to store and process their data securely, without the risk of being subject to extraterritorial laws like the CLOUD Act. SecNumCloud ensures strengthened digital sovereignty by keeping data under exclusive European jurisdiction, shielding it from extraterritorial laws like the CLOUD Act. This certification is essential for strategic sectors such as public services, healthcare, defense, and Operators of Vital Importance (OIVs), thanks to its compliance with the GDPR and European regulations. OIV (Operators of Vital Importance) OIVs refer to public or private entities in France deemed essential to a nation’s functioning, such as energy infrastructure, healthcare systems, defense, and transportation. Their status is defined by the French Interministerial Security Framework for Vital Activities (SAIV), established in the Defense Code. OSE (Operators of Essential Services) Established under the EU NIS Directive (Network and Information Security), OSEs include companies providing critical services to society and the economy, such as banks, insurance providers, and telecommunications firms. Their reliance on information systems makes them particularly vulnerable to cyberattacks. Why It Matters OIVs and OSEs are central to national cybersecurity strategy in France. A successful attack on these entities could have major consequences for a country’s infrastructure and economy. This is why strict regulations and regular monitoring are enforced to ensure their resilience against digital threats. GDPR and the AI Act: Safeguarding Digital Sovereignty The GDPR (General Data Protection Regulation) imposes strict obligations on businesses regarding data collection, storage, and processing, with heavy penalties for non-compliance. The AI Act, currently being adopted by the European Union, complements this framework by regulating the use of artificial intelligence to ensure ethical data processing and protect users. Together, these regulations play a key role in governing digital technologies and increase pressure on businesses to adopt cloud infrastructures that comply with European standards, further strengthening the continent’s digital sovereignty. Part II - SecNumCloud: A Cornerstone to Digital Sovereignety Sovereign Cloud: Key Challenges and Considerations Cloud computing is a major strategic and economic issue. Dependence on American tech giants exposes European data to cybersecurity risks and foreign interference. To mitigate these risks, SecNumCloud ensures the protection of critical data and enforces strict security standards for cloud providers operating under European jurisdiction. SecNumCloud: Setting the Standard for Secure Cloud Services ANSSI designed SecNumCloud as a sovereign response to the CLOUD Act. Today, several French cloud providers, including Outscale, OVHcloud, and S3NS, have adopted this certification. SecNumCloud could serve as a blueprint for the EUCS (European Cybersecurity Certification Scheme for Cloud Services), which seeks to create a unified European standard for a sovereign and secure cloud. A Key Priority for the Public Sector and Critical Infrastructure Operators of Vital Importance (OIVs) and Operators of Essential Services (OSEs), which manage critical infrastructure (energy, telecommunications, healthcare, and transportation), are prime targets for cyberattacks. For example, in 2020, a cyberattack targeted a French hospital and paralyzed its IT infrastructure for several days. This attack jeopardized patient management. Using a sovereign cloud certified by SecNumCloud would have strengthened the hospital’s protection against such an attack by providing better security guarantees and overall greater resilience against cyber threats. Building a European Sovereign Cloud As SecNumCloud establishes itself as a key framework in France, it could serve as a European model. Through the EUCS initiative, the European Union aims to set common standards for a secure and independent cloud, protecting sensitive data from foreign interference. Within this framework, SecNumCloud goes beyond being just a technical certification. It aims to establish itself as a strategic pillar in strengthening Europe’s digital sovereignty and ensuring the resilience of its critical infrastructure. Conclusion The adoption of SecNumCloud is now a strategic priority for all organizations handling sensitive data. By ensuring protection against extraterritorial laws and full compliance with European regulations, SecNumCloud establishes itself as a key pillar of digital sovereignty. Thanks to key players like Outscale, OVH, and S3NS, France and Europe are laying the foundation for a sovereign, secure, and resilient cloud capable of withstanding foreign threats. One More Thing: A Delicate Balance Between Security and Sovereignty If digital sovereignty and data protection are priorities for Europe, it appears essential to place this debate within a broader context. U.S. Security Indeed, U.S. laws address legitimate security concerns. The United States implemented these laws in the context of counterterrorism and cybercrime prevention. The goal of the PATRIOT Act and the CLOUD Act is to enhance intelligence agency cooperation and ensure national security against transnational threats. In this context, American companies have little choice. Cloud giants like Microsoft, Google, and Amazon, to name a few, do not voluntarily enforce the CLOUD Act — they are legally required to comply. Even though they strive to ensure customer data confidentiality, they must adhere to U.S. government requests, even at the risk of conflicting with European laws such as the GDPR. EU Sovereignty Europe does not seek isolation but rather aims for self-reliance in security. The adoption of SecNumCloud and the GDPR is not about blocking American technologies, but about guaranteeing that European companies and institutions keep full authority over their sensitive data. This strategy ensures long-term technological independence while promoting collaboration that respects each region’s legal frameworks. This debate should not be seen as a confrontation between Europe and the United States, but rather as a global strategic challenge: how to balance international security and digital sovereignty in an increasingly interconnected world?
The Slides of Hands-On Agile 2025
February 17, 2025
by
CORE
Elevating Software Delivery Through Pair Programming
February 13, 2025 by
Why and How to Participate in Open-Source Projects in 2025
February 13, 2025
by
CORE
Data Privacy and Governance in Real-Time Data Streaming
February 18, 2025 by
Dive Into Tokenization, Attention, and Key-Value Caching
February 18, 2025 by
Creating an Agentic RAG for Text-to-SQL Applications
February 18, 2025 by
Data Privacy and Governance in Real-Time Data Streaming
February 18, 2025 by
Dive Into Tokenization, Attention, and Key-Value Caching
February 18, 2025 by
Implement Amazon S3 Cross-Region Replication With Terraform
February 18, 2025 by
Creating an Agentic RAG for Text-to-SQL Applications
February 18, 2025 by
Implement Amazon S3 Cross-Region Replication With Terraform
February 18, 2025 by
Container Checkpointing in Kubernetes With a Custom API
February 18, 2025 by
Implement Amazon S3 Cross-Region Replication With Terraform
February 18, 2025 by
Container Checkpointing in Kubernetes With a Custom API
February 18, 2025 by
Generate Unit Tests With AI Using Ollama and Spring Boot
February 18, 2025
by
CORE
Dive Into Tokenization, Attention, and Key-Value Caching
February 18, 2025 by
Creating an Agentic RAG for Text-to-SQL Applications
February 18, 2025 by
The Delegated Chain of Thought Architecture
February 18, 2025
by
CORE