Integrating PostgreSQL Databases with ANF: Join this workshop to learn how to create a PostgreSQL server using Instaclustr’s managed service
Mobile Database Essentials: Assess data needs, storage requirements, and more when leveraging databases for cloud and edge applications.
Artificial intelligence (AI) and machine learning (ML) are two fields that work together to create computer systems capable of perception, recognition, decision-making, and translation. Separately, AI is the ability for a computer system to mimic human intelligence through math and logic, and ML builds off AI by developing methods that "learn" through experience and do not require instruction. In the AI/ML Zone, you'll find resources ranging from tutorials to use cases that will help you navigate this rapidly growing field.
Revolutionizing Software Testing
How To Optimize Feature Sets With Genetic Algorithms
Code scanning for vulnerability detection for exposure of security-sensitive parameters is a crucial practice in MuleSoft API development. Code scanning involves the systematic analysis of MuleSoft source code to identify vulnerabilities. These vulnerabilities could range from hardcoded secure parameters like password or accessKey to the exposure of password or accessKey in plain text format in property files. These vulnerabilities might be exploited by malicious actors to compromise the confidentiality, integrity, or availability of the applications. Lack of Vulnerability Auto-Detection MuleSoft Anypoint Studio or Anypoint platform does not provide a feature to keep governance on above mentioned vulnerabilities. It can be managed by design time governance, where a manual review of the code will be needed. However, there are many tools available that can be used to scan the deployed code or code repository to find out such vulnerabilities. Even you can write some custom code/script in any language to perform the same task. Writing custom code adds another complexity and manageability layer. Using Generative AI To Review the Code for Detecting Vulnerabilities In this article, I am going to present how Generative AI can be leveraged to detect such vulnerabilities. I have used the Open AI foundation model “gpt-3.5-turbo” to demonstrate the code scan feature to find the aforementioned vulnerabilities. However, we can use any foundation model to implement this use case. This can be implemented using Python code or any other code in another language. This Python code can be used in the following ways: Python code can be executed manually to scan the code repository. It can be integrated into the CICD build pipeline, which can scan and report the vulnerabilities and result in build failure if vulnerabilities are present. It can be integrated into any other program, such as the Lambda function, which can run periodically and execute the Python code to scan the code repository and report vulnerabilities. High-Level Architecture Architecture There are many ways to execute the Python code. A more appropriate and practical way is to integrate the Python code into the CICD build pipeline. CICD build pipeline executes the Python code. Python code reads the MuleSoft code XML files and property files. Python code sends the MuleSoft code content and prompts the OpenAI gpt-3.5-turbo model. OpenAI mode returns the hardcoded and unencrypted value. Python code generates the report of vulnerabilities found. Implementation Details MuleSoft API project structure contains two major sections where security-sensitive parameters can be exposed as plain text. src/main/mule folder contains all the XML files, which contain process flow, connection details, and exception handling. MuleSoft API project may have custom Java code also. However, in this article, I have not considered the custom Java code used in the MuleSoft API. src/main/resources folder contains environment property files. These files can be .properties or .yaml files for development, quality, and production. These files contain property key values, for example, user, password, host, port, accessKey, and secretAccesskey in an encrypted format. Based on the MuleSoft project structure, implementation can be achieved in two steps: MuleSoft XML File Scan Actual code is defined as process flow in MuleSoft Anypoint Studio. We can write Python code to use the Open AI foundation model and write a prompt that can scan the MuleSoft XML files containing the code implementation to find hardcoded parameter values. For example: Global.xml/Config.xml file: This file contains all the connector configurations. This is standard recommended by MuleSoft. However, it may vary depending on the standards and guidelines defined in your organization. A generative AI foundation model can use this content to find out hardcoded values. Other XML files: These files may contain some custom code or process flow calling other for API calls, DB calls, or any other system call. This may have connection credentials hard-coded by mistake. A generative AI foundation model can use this content to find out hardcoded values. I have provided the screenshot of a sample MuleSoft API code. This code has three XML files; one is api.xml, which contains the Rest API flow. Process.xml has a JMS-based asynchronous flow. Global.xml has all the connection configurations. api.xml process.xml global.xml For demonstration purposes, I have used a global.xml file. The code snippet has many hardcoded values for demonstration. Hardcoded values are highlighted in red boxes: Python Code The Python code below uses the Open AI foundation model to scan the above XML files to find out the hard-coded values. Python import openai,os,glob from dotenv import load_dotenv load_dotenv() APIKEY=os.getenv('API_KEY') openai.api_key= APIKEY file_path = "C:/Work/MuleWorkspace/test-api/src/main/mule/global.xml" try: with open(file_path, 'r') as file: file_content = file.read() print(file_content) except FileNotFoundError: except Exception as e: print("An error occurred:", e) message = [ {"role": "system", "content": "You will be provided with xml as input, and your task is to list the non-hard-coded value and hard-coded value separately. Example: For instance, if you were to find the hardcoded values, the hard-coded value look like this: name=""value"". if you were to find the non-hardcoded values, the non-hardcoded value look like this: host=""${host}"" "}, {"role": "user", "content": f"input: {file_content}"} ] response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=message, temperature=0, max_tokens=256 ) result=response["choices"][0]["message"]["content"] print(result) Once this code is executed, we get the following outcome: The result from the Generative AI Model Similarly, we can provide api.xml and process.xml to scan the hard-coded values. You can even modify the Python code to read all the XML files iteratively and get the result in sequence for all the files. Scanning the Property Files We can use the Python code to send another prompt to the AI model, which can find the plain text passwords kept in property files. In the following screenshot dev-secure.yaml file has client_secret as the encrypted value, and db.password and jms.password is kept as plain text. config file Python Code The Python code below uses the Open AI foundation model to scan config files to find out the hard-coded values. Python import openai,os,glob from dotenv import load_dotenv load_dotenv() APIKEY=os.getenv('API_KEY') openai.api_key= APIKEY file_path = "C:/Work/MuleWorkspace/test-api/src/main/resources/config/secure/dev-secure.yaml" try: with open(file_path, 'r') as file: file_content = file.read() except FileNotFoundError: print("File not found.") except Exception as e: print("An error occurred:", e) message = [ {"role": "system", "content": "You will be provided with xml as input, and your task is to list the encrypted value and unencrypted value separately. Example: For instance, if you were to find the encrypted values, the encrypted value look like this: ""![asdasdfadsf]"". if you were to find the unencrypted values, the unencrypted value look like this: ""sdhfsd"" "}, {"role": "user", "content": f"input: {file_content}"} ] response = openai.ChatCompletion.create( model="gpt-3.5-turbo", messages=message, temperature=0, max_tokens=256 ) result=response["choices"][0]["message"]["content"] print(result) Once this code is executed, we get the following outcome: result from Generative AI Impact of Generative AI on the Development Life Cycle We see a significant impact on the development lifecycle. We can think of leveraging Generative AI for different use cases related to the development life cycle. Efficient and Comprehensive Analysis Generative AI models like GPT-3.5 have the ability to comprehend and generate human-like text. When applied to code review, they can analyze code snippets, provide suggestions for improvements, and even identify patterns that might lead to bugs or vulnerabilities. This technology enables a comprehensive examination of code in a relatively short span of time. Automated Issue Identification Generative AI can assist in detecting potential issues such as syntax errors, logical flaws, and security vulnerabilities. By automating these aspects of code review, developers can allocate more time to higher-level design decisions and creative problem-solving. Adherence To Best Practices Through analysis of coding patterns and context, Generative AI can offer insights on adhering to coding standards and best practices. Learning and Improvement Generative AI models can "learn" from vast amounts of code examples and industry practices. This knowledge allows them to provide developers with contextually relevant recommendations. As a result, both the developers and the AI system benefit from a continuous learning cycle, refining their understanding of coding conventions and emerging trends. Conclusion In conclusion, conducting a code review to find security-sensitive parameters exposed as plain text using OpenAI's technology has proven to be a valuable and efficient process. Leveraging OpenAI for code review not only accelerated the review process but also contributed to producing more robust and maintainable code. However, it's important to note that while AI can greatly assist in the review process, human oversight and expertise remain crucial for making informed decisions and fully understanding the context of the code.
If you’re anything like me, you’ve noticed the massive boom in AI technology. It promises to disrupt not just software engineering but every industry. THEY’RE COMING FOR US!!! Just kidding ;P I’ve been bettering my understanding of what these tools are and how they work, and decided to create a tutorial series for web developers to learn how to incorporate AI technology into web apps. In this series, we’ll learn how to integrate OpenAI‘s AI services into an application built with Qwik, a JavaScript framework focused on the concept of resumability (this will be relevant to understand later). Here’s what the series outline looks like: Intro and Setup Your First AI Prompt Streaming Responses How Does AI Work Prompt Engineering AI-Generated Images Security and Reliability Deploying We’ll get into the specifics of OpenAI and Qwik where it makes sense, but I will mostly focus on general-purpose knowledge, tooling, and implementations that should apply to whatever framework or toolchain you are using. We’ll be working as closely to fundamentals as we can, and I’ll point out which parts are unique to this app. Here’s a little sneak preview. I thought it would be cool to build an app that takes two opponents and uses AI to determine who would win in a hypothetical fight. It provides some explanation and the option to create an AI-generated image. Sometimes the results come out a little wonky, but that’s what makes it fun. I hope you’re excited to get started because in this first post, we are mostly going to work on… Boilerplate :/ Prerequisites Before we start building anything, we have to cover a couple of prerequisites. Qwik is a JavaScript framework, so we will have to have Node.js (and NPM) installed. You can download the most recent version, but anything above version v16.8 should work. I’ll be using version 20. Next, we’ll also need an OpenAI account to have access to their API. At the end of the series, we will deploy our applications to a VPS (Virtual Private Server). The steps we follow should be the same regardless of what provider you choose. I’ll be using Akamai’s cloud computing services (formerly Linode). Setting Up the Qwik App Assuming we have the prerequisites out of the way, we can open a command line terminal and run the command: npm create qwik@latest. This will run the Qwik CLI that will help us bootstrap our application. It will ask you a series of configuration questions, and then generate the project for you. Here’s what my answers looked like: If everything works, open up the project and start exploring. Inside the project folder, you’ll notice some important files and folders: /src: Contains all application business logic /src/components: Contains reusable components to build our app with /src/routes: Responsible for Qwik’s file-based routing; Each folder represents a route (can be a page or API endpoint). To make a page, drop a index.{jsx|tsx} file in the route’s folder. /src/root.tsx: This file exports the root component responsible for generating the HTML document root. Start Development Qwik uses Vite as a bundler, which is convenient because Vite has a built-in development server. It supports running our application locally, and updating the browser when files change. To start the development server, we can open our project in a terminal and execute the command npm run dev. With the dev server running, you can open the browser and head to http://localhost:5173 and you should see a very basic app. Any time we make changes to our app, we should see those changes reflected almost immediately in the browser. Add Styling This project won’t focus too much on styling, so this section is totally optional if you want to do your own thing. To keep things simple, I’ll use Tailwind. The Qwik CLI makes it easy to add the necessary changes, by executing the terminal command, npm run qwik add. This will prompt you with several available Qwik plugins to choose from. You can use your arrow keys to move down to the Tailwind plugin and press Enter. Then it will show you the changes it will make to your codebase and ask for confirmation. As long as it looks good, you can hit Enter, once again. For my projects, I also like to have a consistent theme, so I keep a file in my GitHub to copy and paste styles from. Obviously, if you want your own theme, you can ignore this step, but if you want your project to look as amazing as mine, copy the styles from this file on GitHub into the /src/global.css file. You can replace the old styles, but leave the Tailwind directives in place. Prepare Homepage The last thing we’ll do today to get the project to a good starting point is make some changes to the homepage. This means making changes to /src/routes/index.tsx. By default, this file starts out with some very basic text and an example for modifying the HTML <head> by exporting a head variable. The changes I want to make include: Removing the head export Removing all text except the <h1>; Feel free to add your own page title text. Adding some Tailwind classes to center the content and make the <h1> larger Wrapping the content with a <main> tag to make it more semantic Adding Tailwind classes to the <main> tag to add some padding and center the contents These are all minor changes that aren’t strictly necessary, but I think they will provide a nice starting point for building out our app in the next post. Here’s what the file looks like after my changes. import { component$ } from "@builder.io/qwik"; export default component$(() => { return ( <main class="max-w-4xl mx-auto p-4"> <h1 class="text-6xl">Hi [wave emoji]</h1> </main> ); }); And in the browser, it looks like this: Conclusion That’s all we’ll cover today. Again, this post was mostly focused on getting the boilerplate stuff out of the way so that the next post can be dedicated to integrating OpenAI’s API into our project. With that in mind, I encourage you to take a moment to think about some AI app ideas that you might want to build. There will be a lot of flexibility for you to put your own spin on things. I’m excited to see what you come up with, and if you would like to explore the code in more detail, I’ll post it on my GitHub account.
Artificial Intelligence (AI), once just a notion from the realms of future prediction, has become an indispensable element of our day-to-day existence, significantly revolutionizing industries worldwide. A prime example of an arena thoroughly transformed by AI is software development. Currently, the inclusion of AI capabilities into software development endeavors isn't merely a fancy addition but a requisite that brings a plethora of advantages. By employing AI, software developers have the capacity to augment application functionality, automate repetitive tasks, enrich user experiences, and even foresee upcoming trends and patterns. This article endeavors to offer a deeper understanding of how AI can be assimilated into your existing software development projects, thereby fostering innovation, streamlining procedures, and in the grand scheme, forging more sturdy and intuitive software solutions. Demystifying AI and Unveiling Its Potentials Artificial Intelligence, often abbreviated as AI, isn't just a trendy buzzword. It's a distinct field within computer science that equips machines with abilities akin to human intelligence. The intention isn't to conjure up visions of a sci-fi landscape but to amplify the potential of your software. The canvas of AI is painted with various hues. Consider Machine Learning (ML), a segment of AI that allows your software to learn and enhance its performance based on experiences without explicit programming. It's akin to envisioning your software as a sentient being capable of self-improvement and adaptation. Next, we encounter Natural Language Processing (NLP), the element of AI that imparts your software with the ability to comprehend, process, and generate human language. The result? Your application can converse with users as effortlessly as if it were a human companion. Finally, we reach Deep Learning, a sophisticated type of machine learning that deploys neural networks to mimic human decision-making processes. It's akin to infusing your software with an additional layer of intellect. Is AI a Good Fit for Your Software Project? Deciding whether to weave AI into your project isn't a spur-of-the-moment decision. It's a strategic move that demands careful thinking. Start by evaluating the project's essence. What's it all about? What problems is it solving? Can AI really add value, or is it just an attractive add-on? For instance, AI could enhance its predictive capabilities if your software project involves data analysis. If it's about customer interaction, AI-powered chatbots might be a game-changer. The key is identifying whether AI can help your software deliver smarter, more efficient, and more personalized experiences. If it can, then that's your green light! Choosing the Right AI Tools and Platforms Equipping yourself with the right AI tools and platforms is like setting out on a treasure hunt. You need to find that perfect blend of utility and ease of use that fits just right with your project's needs and your team's skillset. Start by assessing your project requirements. What kind of AI functionality are you looking for? Then evaluate your team's expertise. Are they comfortable with high-level platforms or prefer working with more detailed, lower-level tools? There's a whole universe of AI platforms out there. From Google's TensorFlow, an open-source library for high-performance numerical computation, to IBM's Watson, known for its enterprise-grade AI services. There's also Azure's AI platform that comes with robust machine learning capabilities, and let's not forget about Amazon's SageMaker for developers who prefer a fully managed service. However, it's not about the brand name but what suits your project and team best. Steps for Integrating AI Into Your Software Development Project Bringing AI into your project may seem like a monumental task, but it becomes an intriguing journey when broken down into manageable steps. Here's a strategic roadmap: 1. Identify the Opportunities Start by figuring out where AI can make a difference. Perhaps it's automating a routine task, enhancing data analysis, or personalizing user interactions. 2. Prepare Your Data AI thrives on data. Gather your data, clean it, and structure it in a format that the AI tools can ingest. Below is an example of importing and preparing data for the AI software project using Python and Pandas. Let's assume the CSV file named 'your_data.csv': Python import pandas as pd # Load your data from a CSV file data = pd.read_csv('your_data.csv') # Display the first few rows of the dataframe print(data.head()) # Clean your data: remove or fill any NaN or missing values # This is a simple example, real-world data cleaning might involve more complex procedures data = data.dropna() # This line removes any rows containing missing values # Alternatively, you can replace missing values with a filler value. For instance, replacing missing values with the mean: # data = data.fillna(data.mean()) # Display the first 5 rows of the cleaned dataframe print(data.head()) This script reads data from a CSV file into a Pandas DataFrame, a 2-dimensional labeled data structure with columns potentially of different types. It then cleans the data by removing any rows with missing data. Real-world data cleaning could involve more complex procedures depending on the nature and structure of your data. Finally, the cleaned data is printed out for verification. The specific data preprocessing steps will depend on your dataset and the specific requirements of your AI model. Different models might require different types of preprocessing. 3. Train Your AI Models Use your data to train your AI models. This is where ML algorithms come into play. The selection of appropriate models for a given project is contingent upon specific requirements and objectives. Depending on the nature of the project, various machine learning methodologies, such as regression, classification, or clustering models, can be employed. Each of these techniques serves distinct purposes and caters to different data types and tasks, offering versatility and flexibility in addressing diverse challenges encountered during the project. Consequently, carefully analyzing the project's characteristics and goals is crucial in determining the most suitable model for optimal performance and effective outcomes. Below is an example of a strategic deep-learning model using TensorFlow and Keras in a hypothetical software project. This model will be a multi-layer perceptron that can be used for the binary classification problem. Here is a full code example: Python # import libraries import numpy as np from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense # Assuming we have some data # Usually this data would be loaded or generated in a real-world scenario n_features = 10 X_train = np.random.random_sample((1000, n_features)) y_train = np.random.randint(2, size=(1000, 1)) X_test = np.random.random_sample((100, n_features)) y_test = np.random.randint(2, size=(100, 1)) # Define the model model = Sequential() model.add(Dense(64, input_dim=n_features, activation='relu')) model.add(Dense(64, activation='relu')) model.add(Dense(1, activation='sigmoid')) # Compile the model model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Train the model model.fit(X_train, y_train, epochs=100, batch_size=64) # Evaluate the model loss, accuracy = model.evaluate(X_test, y_test) print(f"Model Accuracy: {accuracy * 100}%") This code first imports necessary libraries and generates random training and testing data. After that, a model with two hidden layers is defined and compiled. The model is then trained with the available data and evaluated using cross-validation techniques using validation and test sets. This is a strategic example of integrating a deep-learning model into a software project. The choice of the model, its training methodology (batch vs. sequential), and how it's evaluated will depend largely on the nature of the specific project. 4. Testing: Ensuring AI Model Performance and Robustness The process of embedding AI into software development endeavors involves a crucial component: rigorous testing of the deployed models. This critical stage necessitates the verification of your AI models' capability to yield expected results with consistent reliability. Exhaustive testing offers indispensable insights into the model's performance, unveiling potential weaknesses that may not be apparent during developmental phases. Effective testing often hinges upon adopting proven methodologies from the sphere of data science. A keystone amongst these is the method of cross-validation. By leveraging varied subsets of the data for training and testing the model, cross-validation fortifies the credibility of the results. It offers a more intricate understanding of the model's performance when faced with unfamiliar data. It's also prudent to subject the AI models to a multitude of parameters and scenarios during testing. Deploy diverse data inputs to observe the model's reactions under disparate conditions, including outlier cases. This kind of testing can spotlight the regions where your model excels and those requiring additional calibration. It also serves to reinforce your model's resilience in delivering reliable results across varied real-world situations. A thoroughly tested AI model becomes vital in fabricating a software solution that extends beyond mere functionality, offering a trustworthy and user-friendly experience. 5. Deploy and Monitor The final stage involves integrating the AI model into the project and consistently monitoring its performance. It's vital to recognize that the integration of a machine learning model, such as the one shown earlier, within a software application necessitates a series of steps. Key to this understanding is that while the machine learning model forms a component of the application, often acting as the central decision-making entity, the broader software application remains the expansive system with which users interact. To illustrate, let's consider a simplified example of how a deep learning model might be integrated within a software application, such as a web-based application. The trained model described above is saved as follows: Python model.save('my_model.h5') # saves the model in HDF5 format 6. Integrate the AI Model With Your Software Application This would be an application written in a language of your choice. Let's say the situation of creating a Python-based web application using Flask. Below is the strategic structure of a Flask application: Python from flask import Flask, request from tensorflow.keras.models import load_model import numpy as np app = Flask(__name__) model = load_model('my_model.h5') @app.route('/predict', methods=['POST']) def predict(): data = request.get_json(force=True) prediction = model.predict([np.array(data['inputs'])]) output = prediction[0] return str(output) if __name__ == '__main__': app.run(port=8888, debug=True) To create the simple Flask application, the saved model is first loaded to start the application. The function `predict` is mapped to the `/predict` route, crafted to accept POST requests. This function pulls input data from the inbound request, introduces it to the model to generate a prediction, and subsequently returns this prediction as a response. Subsequently, a front-end web page could be developed, enabling users to provide inputs, triggering a POST request to your `/predict` endpoint upon a button click, and ultimately, displaying the prediction. While this is a straightforward example, real-world applications demand attention to facets such as data preprocessing, error management, and perhaps recurrent model retraining with fresh data over time. Additional security measures might be required for the application, along with the capacity to handle larger request volumes or scalability across multiple servers. Each of these considerations introduces an extra layer of complexity to the project. However, this journey is iterative. AI models undergo continuous evolution and refinement as data accumulates and user understanding deepens. Key Challenges Integrating AI into your software development project can present several challenges. Data Privacy and AI Data privacy is one of the major concerns when working with AI, especially in sectors like life science, healthcare, finance, fintech, retail, or any user-centric application. Ensuring that your AI solutions comply with regulations, such as the GDPR in Europe or the CCPA in California, and respecting user privacy is crucial. This challenge can be navigated by implementing robust data management strategies that prioritize security. These may include anonymizing data, implementing proper access controls, and conducting regular audits. The Need for Specialized Skills AI and machine learning are specialized fields requiring a distinct set of skills. The team needs to understand various AI algorithms, model training, testing, and optimization, and the resources may also need to handle large data sets efficiently. To overcome this challenge, consider investing in training for the team or bringing in AI specialists. Managing the Complexity of AI Models AI models, specifically those involving deep learning, can pose significant complexity and computation. This can complicate their management and integration into pre-existing software initiatives. Furthermore, the results derived from AI models aren't always comprehensible, which can be troublesome in sectors where interpretability is key. To counter this, it is advisable to commence with less complex, more comprehensible, and explicable models, then progressively transition to intricate models as required. In addition, consider employing model explainability aspects that can render the results of your AI models more decipherable. Despite the potential obstacles posed by these challenges, they should not dissuade researchers and practitioners from integrating AI into their endeavors. Rather, a methodical approach, continuous learning, and a diligent focus on data management can effectively overcome these obstacles and harness the full potential of AI in software development projects. Through strategic planning and an unwavering commitment to mastering AI technologies, researchers can navigate the complexities and attain successful AI integration, thereby driving innovation and realizing enhanced software solutions. Conclusion In conclusion, the undeniable potential of AI integration in software development presents transformative possibilities. The incorporation of AI capabilities into projects holds the promise of significantly enhancing functionality, streamlining processes, and fostering novel opportunities for innovation. Nevertheless, it is important to acknowledge that this endeavor is not without challenges. Attending to data privacy concerns, fostering essential skills, and adeptly managing the intricacies of AI models necessitate meticulous planning and execution. By embracing a comprehensive and strategic approach, practitioners can effectively navigate these challenges, harnessing AI's power to drive meaningful advancements in the field of software development.
In my previous post, I wrote about finding your competitive advantage as a software developer in a world of encroaching AI. I believe there are a few questions more relevant in the coming decade, so I like to elaborate on the topic in this and my next postings. You can only maintain such a competitive advantage if you focus on tasks that humans by their very nature can do better than machines. There are two necessary properties of software development that illustrate and coincide with the relative strengths and weaknesses of humans versus computers They are efficiency and effectiveness, and will be the focus of today’s post. I give credit to Uwe Friedrichsen for pointing out this distinction in a recent blog. Chess and Submarines Who would be among the most esteemed persons alive, in science, art, and sports? A poll taken in 1985 might include physicist Richard Feynman, actor Meryl Streep, philosopher Susan Sontag, and probably also Gary Kasparov, the young Russian chess grandmaster and new world champion. Several years before, you could already practice chess against a variety of digital opponents. The algorithms in these toy devices must seem unsophisticated now, and their feeble computational horsepower laughable. But an early AI they certainly were, even when they were no match for serious human players, least of all a grandmaster. Moore’s Law quickly took care of that. In little over a decade (1997) the same world champion lost against IBM’s Deep Blue team. This 25-year-old defeat of man against machine did not make us unduly worried that AI would soon take over in some Terminator-like dystopia. If anything, it confirmed that chess is hard for humans and easy for computers. “The question of whether a computer can think is no more interesting than the question of whether a submarine can swim," according to Professor Dijkstra. True, you cannot compare a computer winning at chess with the way Magnus Carlson does it. The way a nuclear submarine displaces water is equally incomparable to how a shark does it. Machine intelligence, whatever way you define it, is only externally analogous to what goes in a human brain. It’s the effect that counts. Yet, I disagree with the suggestion that the question of a thinking computer is uninteresting. The Journey and the Destination According to the Longman dictionary, effectiveness is the fact of producing the result that is wanted or intended; the fact of producing a successful result. Checkmate is such a successful result. Efficiency is the quality of doing something well with no waste of time or money. Of the infinite number of possible matches, fool’s mate is the most efficient one. Apropos: note the purely economic thinking in the definition. Only a “waste of time or money” is considered. Had energy been a factor, no human invention would be efficient compared to its biological counterparts in terms of raw energy consumption. Once the computer can do a job effectively as well as efficiently, humans have made themselves redundant from a cold economic perspective. You’re welcome to dabble for your own enjoyment but don’t expect a paycheck. People still enjoy chess because it was never a means of survival. We don’t need it to grow food or build houses and only a small elite of the best players make money from it. Had chess been a production resource (like land and labor), human chess players would have no competitive advantage. Being effective means arriving at a satisfactory result. We either have it, or we don’t. You can exceed expectations, or fail miserably to any degree, but it’s the binary thumbs-up or thumbs-down that counts. Efficiency is about how you perform the constituent tasks. It’s a quality that allows more leeway. Parts of the process can be less optimal than others and still contribute to an effective result. Other terms that coincide with the dichotomy are the what (and why) versus the how. It’s doing the right thing versus doing things right. It’s the destination versus the journey. Both are necessary. Being consistently inefficient will bankrupt you before long, but being ineffective won’t even get you a first sale. Therefore, no serious software project should be undertaken without a clear definition of the effectiveness we seek to achieve. Only improvisation genius Keith Jarrett could get behind a keyboard without a goal in mind and still produce a masterpiece – but that was a piano keyboard. Efficiency Is Relative, Effectiveness Subjective Effectiveness is fuzzy, unpredictable, and subjective. What do Seinfeld, Monty Python, Bohemian Rhapsody, and Star Trek have in common? They are all beloved popular classics that most people, including the critics, didn’t much notice or appreciate when they first came out. It’s no wonder many products fail, no matter how much market research you throw at it. There is no formula for creativity and no telling how people’s tastes can change. Efficiency is far less infuriating. The customer couldn’t care less what build tool or IDE the Spotify app was made with. It leaves no trace in the final product. Modern products are a complicated assemblage of parts from various suppliers. We choose parts for their efficiency in the hope they make an effective product that customers want to buy. But while each part is already effective from the supplier’s perspective when they make a sale, it only becomes efficient when it’s the right part. This extends to more than physical parts. Here's a real-world example of semi-conductor giant ASML, near my hometown. Morning traffic to the campus is hell, and public transport is only served by buses. The city council wants to lay a new bicycle lane to entice commuters living in a ten-mile radius to cycle to work. The goal is to get everybody to the office safely and fast by relieving the congestion for whom bicycles are not an option. The contractors building this road have no urgent stake in this efficiency drive, much less bicycle manufacturer Trek or Shimano, who supplies the gears and brakes to make the bikes run smoothly. But they all contribute to the effectiveness of the new road. Efficiency is relative, and effectiveness subjective. You get the point. Software is no different. The average enterprise product consists mostly of other people’s code you and your team can’t control (99.9% is a safe bet, certainly if you count the cloud stack you deploy to). Any component can be efficient in some places and useless elsewhere. A highly optimized caching mechanism backed by a dedicated Oracle enterprise is still a waste of money if you only need it to remember fifty numbers for an hour. Efficiency is about experimentation, making small tweaks, and swapping out slow/expensive components for better-performing ones. It’s a complicated, data-driven domain where computers feel right at home. Judging the effectiveness of software on the other hand is complex: it comes down to whether it ultimately makes the recipients happy. Who else but other people are qualified to answer that question and make the decisions? In the next post, I will focus on requirements/specification versus implementation and how they coincide with the efficiency/effectiveness distinction. In part three, I’ll discuss the famous alignment problem. Even humans fail at aligning software goals with their own interests and build expensive failures. How can we expect machines to do better? Then in part four, I will address our love of specialization, coding for coding’s sake, and why that will no longer be a competitive advantage soon.
There are many libraries out there that can be used in machine learning projects. Of course, some of them gained considerable reputations through the years. Such libraries are the straight-away picks for anyone starting a new project which utilizes machine learning algorithms. However, choosing the correct set (or stack) may be quite challenging. The Why In this post, I would like to give you a general overview of the machine learning libraries landscape and share some of my thoughts about working with them. If you are starting your journey with a machine learning library, my text can give you some general knowledge of the machine learning libraries and provide a better starting point for learning more. The libraries described here will be divided by the role they can play in your project. The categories are as follows: Model Creation - Libraries that can be used to create machine learning models Working with data - Libraries that can be used both for feature engineering, future extraction, and all other operations that involve working with features Hyperparameters optimization - Libraries and tools that can be used for optimizing model hyperparameters Experiment tracking - Libraries and tools used for experiment tracking Problem-specific libraries - Libraries that can be used for tasks like time series forecasting, computer vision, and working with spatial data Utils - Non-strictly machine learning libraries but nevertheless the ones I found useful in my projects Model Creation PyTorch Developed by people from Facebook and open-sourced in 2017, it is one of the market's most famous machine learning libraries - based on the open-source Torch package. The PyTorch ecosystem can be used for all types of machine learning problems and has a great variety of purpose-built libraries like torchvision or torchaudio. The basic data structure of PyTorch is the Tensor object, which is used to hold multidimensional data utilized by our model. It is similar in its conception to NumPy ndarray. PyTorch can also use computation accelerators, and it supports CUDA-capable NVIDIA GPUs, ROCm, Metal API, and TPU. The most important part of the core PyTorch library is nn modules which contains layers and tools to build complex models layer by layer easily. class NeuralNetwork(nn.Module): def __init__(self): super().__init__() self.flatten = nn.Flatten() self.linear_relu_stack = nn.Sequential( nn.Linear(28*28, 512), nn.ReLU(), nn.Linear(512, 512), nn.ReLU(), nn.Linear(512, 10), ) def forward(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits An example of a simple neural network with 3 linear layers in PyTorch Additionally, PyTorch 2.0 is already released, and it makes PyTorch even better. Moreover, PyTorch is used by a variety of companies like Uber, Tesla, and Facebook, just to name a few. PyTorch Lightning It is a sort of “extension” for PyTorch, which aims to greatly reduce the amount of boilerplate code needed to utilize our models. Lightning is based on the concept of hooks: functions called on specific phases of the model train/eval loop. Such an approach allows us to pass callback functions executed at a specific time, like the end of the training step. Trainers Lighting automates many functions that one has to take care of in PyTorch; for example, loop, hardware call, or zero grads. Below are roughly equivalent code fragments of PyTorch (Left) and PyTorch Lightning (Right). Image Source TensorFlow A library developed by a team from Google Brain was originally released in 2015 under Apache 2.0 License, and version 2.0 was released in 2019. It provides clients in Java, C++, Python, and even JavaScript. Similar to PyTorch, it is widely adopted throughout the market and used by companies like Google (surprise), Airbnb, and Intel. TensorFlow also has quite an extensive ecosystem built around it by Google. It contains tools and libraries such as an optimization toolkit, TensorBoard (more about it in the "Experiment Tracking" section below), or recommenders. The TensorFlow ecosystem also includes a web-based sandbox to play around with your model’s visualization. Again the tf.nn module plays the most vital part providing all the building blocks required to build machine learning models. Tensorflow uses its own Tensor (flow ;p) object for holding data utilized by deep learning models. It also supports all the common computation accelerators like CUDA or RoCm (community), Metal API, and TPU. class NeuralNetwork(models.Model): def __init__(self): super().__init__() self.flatten = layers.Flatten() self.linear_relu_stack = models.Sequential([ layers.Dense(512, activation='relu'), layers.Dense(512, activation='relu'), layers.Dense(10) ]) def call(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits Note that in TensorFlow and Keras, we use the Dense layer instead of the Linear layer used in PyTorch. We also use the call method instead of the forward method to define the forward pass of the model. Keras It is a library similar in meaning and inception to PyTorch Lightning but for TensorFlow. It offers a more high-level interface over TensorFlow. Developed by François Chollet and released in 2015, it provides only Python clients. Keras also has its own set of Python libraries and problem-specific libraries like KerasCV or KerasNLP for more specialized use cases. Before version 2.4 Keras supported more backends than just TensorFlow, but after the release, TensorFlow became the only supported backend. As Keras is just an interface for TensorFlow, it shares similar base concepts as its underlying backend. The same holds true for supported computation accelerators. Keras is used by companies like IBM, PayPal, and Netflix. class NeuralNetwork(models.Model): def __init__(self): super().__init__() self.flatten = layers.Flatten() self.linear_relu_stack = models.Sequential([ layers.Dense(512, activation='relu'), layers.Dense(512, activation='relu'), layers.Dense(10) ]) def call(self, x): x = self.flatten(x) logits = self.linear_relu_stack(x) return logits Note that in TensorFlow and Keras, we use the Dense layer instead of the Linear layer used in PyTorch. We also use the call method instead of the forward method to define the forward pass of the model. PyTorch vs. TensorFlow I wouldn’t be fully honest If I would not introduce some comparison between these two. As you could read, a moment before, both of them are quite similar in offered features and the ecosystem around them. Of course, there are some minor differences and quirks in how both work or the features they provide. In my opinion, these are more or less insignificant. The real difference between them comes from their approach to defining and executing computational graphs of the machine and deep learning and models. PyTorch uses dynamic computational graphs, which means that the graph is defined on-the-fly during execution. This allows for more flexibility and intuitive debugging, as developers can modify the graph at runtime and easily inspect intermediate outputs. On the other hand, this approach may be less efficient than static graphs, particularly for complex models. However, PyTorch 2.0 attempts to address these issues via torch.compile and FX graphs. TensorFlow uses static computational graphs, which are compiled before execution. This allows for more efficient execution, as the graph can be optimized and parallelized for the target hardware. However, it can also make debugging more difficult, as intermediate outputs are not readily accessible. Another noticeable difference is that PyTorch seems to be more low level than Keras while being more high level than pure TensorFlow. Such a setting makes PyTorch more elastic and easier to use for making tailored models with many customizations. As a side note, I would like to add that both libraries are on equal terms in the case of market share. Additionally, despite the fact that TensorFlow uses the call method and PyTorch uses the forward method both libraries support call semantics as a shorthand for calling model(x). Working With Data pandas A library that you must have heard of if you are using Python, it is probably the most famous Python library for working with data of any type. It was originally released in 2008, and version 1.0 in 2012. It provides functions for filtering, aggregating, and transforming data, as well as merging multiple datasets. The cornerstone of this library is a DataFrame object which represents a multidimensional table of any type of data. The library heavily focuses on performance with some parts written in pure C to boost performance. Besides being performance-focused, pandas provide a lot of features related to: Data cleaning and preprocessing Removing duplicates Filling null values or nan values. Time series analysis Resampling Windowing Time shift Additionally, it can perform a variety of input/output operations: Reading from/to .csv or .xlsx files Performing database queries Load data from GCP BigQuery (with the help of pandas-gbq) NumPy It is yet another famous library for working with data - mostly numeric data science in kind. The most famous part of NumPy is an ndarray - a structure representing a multidimensional array of numbers. Besides ndarray, NumPy provides a lot of high-level mathematical functions and mathematical operations used to work with this data. It is also probably the oldest library in this set as the first version was released in 2005. It was implemented by Travis Oliphant, based on an even older library called numeric (released in 1996). NumPy is extremely focused on performance with contributors trying to get more and more of the currently implemented algorithm to reduce the execution time of NumPy functions even more. Of course, as with all libraries described here, numpy is also open source and uses a BSD license. SciPy It is a library focused on supporting scientific computations. It is even older than NumPy (2005), released in 2001. It is built atop NumPy, with ndarray being the basic data structure used throughout SciPy. Among other things, the library adds functions for optimization, linear algebra, signal processing, interpolation, and spares matrix support. In general, it is more high-level than NumPy and thus can provide more complex functions. Hyperparameter Optimizations Ray Tune It is part of the Ray toolset, a bundle of related libraries for building distributed applications focusing on machine learning and Python. The Tune part of the ML library is focused on providing hyper-priming optimization features by providing a variety of search algorithms; for example, grid search, hyperband, or Bayesian Optimization. Ray Tune can work with models created in most programming languages and libraries available on the market. All of the libraries described in the paragraph about model creation are supported by Ray Tune. The key concepts of Ray Tune are: Trainables - Objects pass to Tune runs; they are our model we want to optimize parameters for Search space - Contains all the values of hyperparameters we want to check in the current trial Tuner - An object responsible for executing runs, calling tuner.fit() starts the process of searching for optimal hyperparameters set. It requires passing at least a trainable object and search space Trial- Each trail represents a particular run of a trainable object with an exact set of parameters from a search space. Trials are generated by Ray Tune Tuner. As it represents the output of running the tuner, Trial contains a ton of information, such as: Config used for a particular trial Trial ID Many others Search algorithms - The algorithm used for a particular execution of Tuner.fit; if not provided, Ray Tune will use RadomSearch as the default Schedulers - These are objects that are in charge of managing runs. They can pause, stop and run trials within the run. This can result in increased efficiency and reduced time of the run. If none is selected, the Tune will pick FIFO as default - run will be executed one by one as in a classic queue. Run analyses - The object conniving the results of Tuner.fit execution in the form of ResultGrid object. It contains all the data related to the run like the best result among all trials or data from all trials. BoTorch BoTorch is a library built atop PyTorch and is part of the PyTorch ecosystem. It focuses solely on providing hyperparameter optimization with the use of Bayesian Optimization. As the only library in this part that is designed to work with a specific model library, it may make it problematic for BoTorch to be used with libraries other than PyTorch. Also as the only library it is currently in version beta and under intensive development, so some unexpected problems may occur. The key feature of BoTorch is its integration with PyTorch, which greatly impacts the ease of interaction between the two. Experiment Tracking Neptune.ai It is a web-based tool that serves both as an experiment tracking and model registry. The tool is cloud-based in the classic SaaS model, but if you are determined enough, there is the possibility of using a self-hosted variant. It provides a dashboard where you can view and compare the results of training your model. It can also be used to store the parameters used for particular runs. Additionally, you can easily version the dataset used for particular runs and all the metadata you think may come in handy. Moreover, it enables easy version control for your models. Obviously, the tool is library agnostic and can host models created using any library. To make the integration possible, Neptune exposes a REST-style API with its own client. The client can be downloaded and installed via pip or any other Python dependency tool. The API is decently documented and quite easy to grasp. The tool is paid and has a simple pricing plan divided into three categories. Yet if you need it just for a personal project or you are in the research or academic unit, then you can apply to use the tool for free. Neptune.ai is a new tool, so some features known from other experiment tracing tools may not be present. Despite that fact, Neptune.ai support is keen to react to their user feedback and implement missing functionalities – at least, it was the thing in our case as we use Neptune.ai extensively in The Codos Project. Weights & Biases Also known as WandB or W&B, this is a web-based tool that exposes all the needed functionalities to be used as an experiment tracing tool and model registry. It exposes a more or less similar set of functionalities as neptune.ai. However, Weights & Biases seem to have better visualization and, in general, is a more mature tool than Neptune. Additionally, WandB seems to be more focused on individual projects and researchers with less emphasis on collaboration. It also has a simple pricing plan divided into 3 categories with a free tier for private use. Yet W&B has the same approach as Neptune to researchers and academic units - they can always use Weights & Biases for free. Weights & Biases also expose an REST-like API to ease the integration process. It seems to be a better document and offers more feathers than the one exposed by Neptune.ai. What is curious is that they expose the client library written in Java - if, for some reason, you wrote a machine learning model in Java instead of Python. TensorBoard It is a dedicated visualization toolkit for the TensorFlow ecosystem. It is designed mostly to work as an experiment tracking tool with a focus on metrics visualizations. Despite being a dedicated TensorFlow tool, it can also be used with Keras (not surprisingly) and PyTorch. Additionally, it is the only free tool from all three described in this section. Here you can host and track your experiments. However, TensorBoard is missing the functionalities responsible for model registry which can be quite problematic and force you to use some 3rd party tool to cover this missing feature. Anyway, there is for sure a tool for this in the TensorFlow ecosystem. As it is a directory part of the TensorFlow ecosystem, its integration with Keras or TensorFlow is much smoother than any of the previous two tools. Problem Specific Libraries tsaug One of a few libraries for augmentation of time series, it is an open-source library created and maintained by a single person under the GitHub name nick tailaiw, released in 2019, and is currently in version 0.2.1. It provides a set of 9 augmentations; Crop, Add Noise, or TimeWrap, among them. The library is reasonably well documented for such a project and easy to use from a user perspective. Unfortunately, for unknown reasons (at least to me), the library seems dead and has not been updated for 3 years. There are many open issues but they are not getting any attention. Such a situation is quite sad in my opinion, as there are not so many other libraries that provide augmentation for time series data. However, if you are looking for a time series for data mining or augmentation and want to use a more up-to-date library, Tsfresh may be a good choice. OpenCV It is a library focused on providing functions for working with image processing and computer vision. Developed by Intel, it is now open source based on the Apache 2 license. OpenCV provides a set of functions related to image and video processing, image classification data analysis, and tracking alongside ready-made machine learning models for working with images and video. If you want to read more about OpenCV, my coworker, Kamil Rzechowski, wrote an article that quite extensively describes the topic. GeoPandas It is a library built atop pandas that aims to provide functions for working with spatial and data structures. It allows easy reading and writing data in GeJSON, shapefile formats, or reading data from PostGIS systems. Besides pandas, it has a lot of other dependencies on spatial data libraries like PyGEOS, GeoPy, or Shapely. The library's basic structure is: GeoSeries - A column of geospatial data, such as a series of points, lines, or polygons GeoDataFrame - Tabular structure holding a set of GeoSeries Utils Matplotlib As the name suggests, it is a library for creating various types of plots ;p. Besides basic plots like lines or histograms, matplotlib allows us to create more complex plots: 3d shapes or polar plots. Of course, it also allows us to customize things like color plots or labels. Despite being somewhat old (released in 2003, so 20 years old at the time of writing), it is actively maintained and developed. With around 17k stars on GitHub, it has quite a community around it and is probably the second pick for anyone needing a data visualization tool. It is well-documented and reasonably easy to grasp for a newcomer. It is also worth noting that matplotlib is used as a base for more high-level visualization libraries. Seaborn Speaking of which, we have Seaborn as an example of such a library. Thus the set of functionalities provided by Seaborn is similar to one provided by Matplotlib. However, the API is more high-level and requires less boilerplate code to achieve similar results. As for other minor differences, the color palette provided by Seaborn is softer, and the design of plots is more modern and nice looking. Additionally, Seaborn is easiest to integrate with pandas, which may be a significant advantage. Below you can find code used to create a heatmap in Matplotlib and Seaborn, alongside their output plots. Imports are common. import matplotlib.pyplot as plt import numpy as np import seaborn as sns data = np.random.rand(5, 5) fig, ax = plt.subplots() heatmap = ax.pcolor(data, cmap=plt.cm.Blues) ax.set_xticks(np.arange(data.shape[0])+0.5, minor=False) ax.set_yticks(np.arange(data.shape[1])+0.5, minor=False) ax.set_xticklabels(np.arange(1, data.shape[0]+1), minor=False) ax.set_yticklabels(np.arange(1, data.shape[1]+1), minor=False) plt.title("Heatmap") plt.xlabel("X axis") plt.ylabel("Y axis") cbar = plt.colorbar(heatmap) plt.show() sns.heatmap(data, cmap="Blues", annot=True) # Set plot title and axis labels plt.title("Heatmap") plt.xlabel("X axis") plt.ylabel("Y axis") # Show plot plt.show() Hydra In every project, sooner or later, there is a need to make something a configurable value. Of course, if you are using a tool like Jupyter, then the matter is pretty straightforward. You can just move a desired value to the .env file - et voila, it is configurable. However, if you are building a more standard application, things are not that simple. Here Hydra shows its ugly (but quite useful) head. It is an open-source tool for managing and running configurations of Python-based applications. It is based on OmegaConf library, and quoting from their main page: “The key feature is the ability to dynamically create a hierarchical configuration by composition and override it through config files and the command line.” What proved quite useful for me is the ability described in the quote above: hierarchical configuration. In my case, it worked pretty well and allowed clearer separation of config files. coolname Having a unique identifier for your taint runs is always a good idea. If, for various reasons, you do not like UUID or just want your ids to be humanly understandable, coolname is the answer. It generates unique alphabetical word-based identifiers of various lengths from 2 to 4 words. As for the number of combinations, it looks more or less like this: 4 words length identifier has 1010 combinations 3 words length identifier has 108 combinations 2 words length identifier has 105 combinations The number is significantly lower than in the case of UUID, so the probability of collision is also higher. However, comparing the two is not the point of this text. The vocabulary is hand-picked by the creators. Yet they described it as positive and neutral (more about this here), so you will not see an identifier like you-ugly-unwise-human-being. Of course, the library is fully open source. tqdm This library provides a progress bar functionally for your application. Despite the fact that having such information displayed is maybe not the most important thing you need. It is still nice to look at and check the progress made by your application during the execution of an important task. Tqdm also uses complex algorithms to estimate the remaining time of a particular task which may be a game changer and help you organize your time around. Additionally, tqdm states that it has barely noticeable performance overhead - in nanoseconds. What is more, it is totally standalone and needs only Python to run. Thus it will not download half of the internet to your disk. Jupyter Notebook (+JupyterLab) Notebooks are a great way to share results and work on the project. Through the concept of cells, it is easy to separate different fragments and responsibilities of your code. Additionally, the fact that the single notebook file can contain code, images, and complex text outputs (tables) together only adds to its existing advantages. Moreover, notebooks allow run pip install inside the cells and use .env files for configuration. Such an approach moves a lot of software engineering complexity out of the way. Summary These are all the various libraries for machine learning that I wanted to describe for you. I aimed to provide a general overview of all the libraries alongside their possible use cases enriched by a quick note of my own experience with using them. I hope that my goal was achieved and this article will deepen your knowledge of the machine learning libraries landscape.
VARMAX-As-A-Service is an MLOps approach for the unification and reuse of statistical models and machine learning models deployment pipelines. It is the first of a series of articles that will be built on top of that project, representing experiments with various statistical and machine learning models, data pipelines implemented using existing DAG tools, and storage services, both cloud-based and alternative on-premises solutions. But what is VARMAX and statistical models in general and how are they different from machine learning models? Statistical Models Statistical models are mathematical models, and so are machine learning models. A statistical model is usually specified as a mathematical relationship between one or more random variables and other non-random variables. As such, a statistical model is "a formal representation of a theory" and belongs to the field of statistical inference. Some statistical models can be used to make predictions using predefined mathematical formulas and estimating coefficients based on historical data. Statistical models explicitly specify a probabilistic model for the data and identify variables that are usually interpretable and of special interest, such as the effects of predictor variables. In addition to identifying relationships between variables, statistical models establish both the scale and significance of the relationship. Machine Learning Models Machine learning models can be interpreted as mathematical models, too. ML models can also build predictions based on historical data without explicit programming. However, those models are more empirical. ML usually does not impose relationships between predictors and outcomes nor isolate the effect of any single variable. The relationships between variables might not be understandable, but what we are provided after are predictions. Now, let us focus on predictions or forecasts applied to time series data. Time series data can be defined as (successive measurements made from the same source over a fixed time interval) using statistical modeling and machine learning. Both are represented by an underlying mathematical model; they need data to be trained/parameterized, and they produce new time series representing foretasted values. Having those similarities, we will apply the approach used to expose machine learning models as services to a statistical model used for time series forecasting called VARMAX. VARMAX VARMAX is a statistical model or, generally speaking, a procedure that estimates the model parameters and generates forecasts. Often, economic or financial variables are not only contemporaneously correlated with each other but also correlated with each other’s past values. VARMAX procedure can be used to model these types of time relationships. This article is based on an application called VARMAX-As-A-Service that can be found on a dedicated GitHub repository. It is comprised of two main components: Runtime component — a dockerized deployable REST service Preprocessing component — a set of Python functions responsible for data loading, model optimization, model instantiation, and model serialization, enabling its future reuse The architecture of the runtime component comprising the application is depicted in the following picture: The user sends in a request via a browser to an Apache Web Server hosting the model. Behind the scenes, this is a Python Flask application that is calling a previously configured and serialized pickle file model. Note Pickle is a serialization module in Python's standard library. Flask is a lightweight WSGI web application framework. It is designed to make getting started quick and easy, with the ability to scale up to complex applications. Flask is a good choice as a framework for the implementation of a web service exposing the statistical model API, and it also provides a web server for testing. However, for deployment in production, we need a web server and gateway interface. The Docker image created in that project is deploying the Flask application using Apache httpd and WSGI (Web Server Gateway Interface) on a Linux-based system. Apache is a powerful and widely-used web server, while WSGI is a standard interface between web servers and Python applications. Apache httpd is a fast, production-level HTTP server. Serving as a “reverse proxy,” it can handle incoming requests, TLS, and other security and performance concerns better than a WSGI server. The Docker image, as well as the model code, can be found on a dedicated GitHub repository. REST services can be easily integrated into existing web applications as part of an algorithm or as a step in a DAG of a prediction data pipeline (see Apache Airflow, Apache Beam, AWS Step Functions, Azure ML Pipeline). The integration as a part of the pipeline will be the focus of an upcoming article while this one exposes the service as Swagger documented endpoint and Swagger UI for testing and experimenting with various input datasets. After deploying the project, the Swagger API is accessible via <host>:<port>/apidocs (e.g., 127.0.0.1:80/apidocs). There are two endpoints implemented, using the user's input parameters and sending an input file: Internally, the service uses the deserialized model pickle file: Requests are sent to the initialized model as follows: Prior to the implementation of the REST service and its deployment, the actual model needs to be prepared. In the picture below, the step needed to prepare the model for deployment is called a preprocessing step. It should not be confused with the term data processing from data analysts. In the example project, the data set used to optimize the model parameters is called United States Macroeconomic data and is provided by the Python library statsmodels without the need to apply additional data processing. The preprocessing algorithm is comprised of the following steps: Load data Divide data into train and test data set Prepare exogenous variables Find optimal model parameters (p, q) Instantiate the model with the optimal parameters identified Serialize the instantiated model to a pickle file And finally, the steps needed to run the application are: Execute model preparation algorithm: python varmax_model.py Deploy application: docker-compose up -d Test model: http://127.0.0.1:80/apidocs The presented project is a simplified workflow that can be extended step by step with additional functionalities like: Store model files in a remote repository (e.g., Relational Database, MinIO Service, S3 Storage) Explore standard serialization formats and replace the pickle with an alternative solution Integrate time series data visualization tools like Kibana or Apache Superset Store time series data in a time series database like Prometheus, TimescaleDB, InfluxDB Extend the pipeline with data loading and data preprocessing steps Track model versions Incorporate metric reports as part of the Implement pipelines using specific tools like Apache Airflow or AWS Step Functions or more standard tools like Gitlab or GitHub Compare statistical models' performance and accuracy with machine learning models Implement end-to-end cloud-integrated solutions, including Infrastructure-As-Code Expose other statistical and ML models as services Some of these future improvements will be the focus of the next articles and projects. The goal of this article is to build the basic project structure and a simple processing workflow that can be extended and improved over time. However, it represents an end-to-end infrastructural solution that can be deployed on production and improved as part of a CI/CD process over time.
What Is Generative AI? Generative AI is a category of artificial intelligence (AI) techniques and models designed to create novel content. Unlike simple replication, these models produce data — such as text, images, music, and more from scratch by leveraging patterns and insights gleaned from a training dataset. How Does Generative AI Work? Generative AI employs diverse machine learning techniques, particularly neural networks, to decipher patterns within a given dataset. Subsequently, this knowledge is harnessed to generate new and authentic content that mirrors the patterns present in the training data. While the precise mechanism varies based on the specific architecture, the following offers a general overview of common generative AI models: Generative Adversarial Networks (GANs): GANs consist of two principal components: a generator and a discriminator. The generator's role involves crafting fresh data instances, such as images, by converting random noise into data that echoes the training data. The discriminator strives to differentiate between genuine data from the training set and fabricated data produced by the generator. Both components are concurrently trained in a competitive process, with the generator evolving by learning from the discriminator's feedback. Over time, the generator becomes adept at crafting data that increasingly resembles authentic information. Variational Autoencoders (VAEs): VAEs belong to the autoencoder neural network category, comprising an encoder network and a decoder network. The encoder maps an input data point (e.g., an image) to a reduced-dimensional latent space representation. The decoder, conversely, generates a reconstruction of the original data based on a point in the latent space. VAEs focus on acquiring a probabilistic distribution over the latent space during training, facilitating the generation of fresh data points by sampling from this distribution. These models ensure the generated data closely resembles the input data while adhering to a specific distribution, usually a Gaussian distribution. Autoregressive Models: For instance, in text generation, the model may predict the subsequent word based on preceding words within a sentence. These models undergo training via maximum likelihood estimation, where the aim is to maximize the likelihood of producing the actual training data. Transformer-Based Models: Models like the Generative Pre-trained Transformer (GPT) utilize a transformer architecture to generate text and other sequential data. Transformers process data in parallel, enhancing efficiency for generating extensive sequences. The model assimilates relationships among different elements within the data, enabling the creation of coherent and contextually relevant sequences. In all instances, generative AI models are trained using a dataset containing examples of the desired output. Training involves tuning the model's parameters to minimize differences between generated and actual data. Once trained, these models can craft new data by drawing on learned patterns and distributions, with the quality of output improving through exposure to more varied and representative training data. How To Develop Generative AI Models Developing generative AI models entails a structured process encompassing data preparation, model selection, training, evaluation, and deployment. The ensuing guide outlines key stages in developing generative AI models: Define the task and collect data: Clearly define the intended generative task and type of content (e.g., text, images, music). Curate a diverse and high-quality dataset representative of the target domain. Select a Generative Model Architecture: Choose an architecture tailored to the task, such as Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), autoregressive models, or transformer-based models like GPT. Preprocess and prepare data: Clean, preprocess, and format the dataset to suit training requirements. This may involve text tokenization, image resizing, normalization, and data augmentation. Split data for training and validation: Divide the dataset into training and validation subsets. Validation data aids in monitoring and preventing overfitting. Design the model Architecture: Architect the neural network model, specifying layers, connections, and parameters based on the chosen framework. Define Loss Functions and metrics: Select suitable loss functions and evaluation metrics, tailored to the generative task. GANs may employ adversarial loss, while language models might use language modeling metrics. Train the model: Train the model using prepared training data, adjusting hyperparameters like learning rate and batch size. Monitor performance on the validation set, iteratively refining training parameters. Evaluate model performance: Employ various evaluation metrics—quantitative and qualitative—to assess output quality, diversity, and novelty. Fine-tune and iterate: Based on the evaluation results, refine the model architecture and training process. Experiment with variations to optimize performance. Address bias and ethical considerations: Mitigate biases, stereotypes, or ethical concerns in the generated content, prioritizing responsible AI development. Generate and test new content: Upon achieving satisfactory performance, deploy the model to generate new content. Test in real-world scenarios and gather user feedback. Deploy the model: If the model meets the requirements, integrate it into the desired application, system, or platform. Continuously monitor and update: Maintain model performance over time by monitoring and updating in response to evolving needs and data. Generative AI model development involves iterative experimentation, emphasizing technical and ethical considerations. Collaboration with domain experts, data scientists, and AI researchers enhances the creation of effective and responsible generative AI models. What Are the Use Cases for Generative AI? Generative AI has permeated numerous domains, facilitating the creation of original content in various forms. The following outlines some of the most prevalent applications of generative AI: Text generation and language modeling: Prominent in article and creative writing, chatbots, language translation, code generation, and other text-based tasks. Image generation and style transfer: Utilized for realistic image creation, artistic style modification, and the generation of photorealistic portraits. Music composition and generation: Applied to compose music, devise melodies, harmonies, and entire compositions spanning diverse genres. Content recommendation: Employs generative techniques to offer personalized content recommendations, spanning movies, music, books, and products. Natural Language Generation (NLG): Generates human-readable text from structured data, enabling automated report creation, personalized messages, and product descriptions. Fake content detection and authentication: Develops tools to detect and counteract fake news, deepfakes, and other manipulated or synthetic content. Healthcare and medical imaging: Enhances medical imaging with image resolution enhancement, synthesis, and 3D model generation for diagnosis and treatment planning. These applications exemplify the diverse and far-reaching impact of generative AI across industries and creative domains. As AI progresses, innovative applications are likely to emerge, further expanding the horizons of generative AI technology. What Are the Challenges of Generative AI? Generative AI has made remarkable strides in generating novel and creative content, but it also faces several challenges that researchers and practitioners need to address. Some of the key challenges of generative AI include: Mode collapse and lack of diversity: In some cases, generative models like GANs can suffer from "mode collapse," where the model generates a limited variety of outputs or gets stuck in a subset of the possible modes in the data distribution. Ensuring diverse and varied outputs remains a challenge. Training instability: Training generative models, especially GANs, can be unstable and sensitive to hyperparameters. Finding the right balance between generator and discriminator and maintaining stable training can be challenging. Evaluation metrics: Defining appropriate metrics to evaluate the quality of generated content is challenging, especially for subjective tasks like art and music generation. Metrics may not always capture the full spectrum of quality, novelty, and creativity. Data quality and bias: The quality of training data significantly affects the performance of generative models. Biases and inaccuracies in the training data can lead to biased or undesirable outputs. Addressing data quality and biases is crucial. Ethical concerns: Generative AI can be misused for creating fake content, deepfakes, or spreading misinformation. Computational resources: Training complex generative models requires significant computational resources, including powerful GPUs or TPUs and substantial memory. This can limit accessibility and scalability. Interpretable and controllable generation: Understanding and controlling the output of generative models is challenging. Ensuring that the generated content aligns with user intentions and preferences is an ongoing research area. Long-range dependencies: Some generative models struggle with capturing long-range dependencies in sequential data, leading to issues like unrealistic text generation or a lack of coherence. Transfer learning and fine-tuning: Adapting pre-trained generative models to specific tasks or domains while retaining their learned knowledge is a complex process that requires careful fine-tuning. Resource-intensive training: Training large-scale generative models can consume a significant amount of time and energy, making it important to explore more energy-efficient training techniques. Real-time generation: Achieving real-time or interactive generative AI applications, such as live music composition or video game content generation, poses challenges in terms of speed and responsiveness. Generalization and creativity: Ensuring that generative models generalize well to diverse inputs and produce truly creative and innovative outputs remains a challenge. Addressing these challenges involves ongoing research, innovation, and collaboration among AI practitioners, researchers, and ethicists. As generative AI continues to evolve, advancements in these areas will contribute to safer, more reliable, and ethically responsible AI systems. Conclusion Generative AI pioneers the forefront of AI, ushering in a creative era. This technique crafts original content by learning intricate patterns from data, spanning text, images, and music. Through diverse machine learning methods, particularly neural networks, generative AI spawns novel expressions. In the grand AI tapestry, generative AI emerges as a dynamic thread, illuminating a path where machines partner in human expression's symphony.
AI is under the hype now, and some products overuse the AI topic a lot — however, many companies and products are automating their processes using this technology. In the article, we will discover AI products and build an AI landing zone. Let’s look into the top 3 companies that benefit from using AI. Github Copilot Github Copilot’s primary objective is to aid programmers by providing code suggestions and auto-completing lines or blocks of code while they write. By intelligently analyzing the context and existing code, it accelerates the coding process and enhances developer productivity. It becomes an invaluable companion for developers throughout their coding journey, capable of supporting various programming languages and comprehending code patterns. Neuraltext Neuraltext strives to encompass the entire content workflow, encompassing everything from generating ideas to executing them, all powered by AI. It is an AI-driven copywriter, SEO content, and keyword research tool. By leveraging AI copywriting capabilities, you can effortlessly produce compelling copy for your campaigns, generating numerous variations. With a vast collection of over 50 pre-designed templates for various purposes, such as Facebook ads, slogan ideas, blog sections, and more, Neuraltext simplifies the content creation process. Motum Motum is the intelligent operating system for operational fleet management. It has damage recognition that uses computer vision and machine learning algorithms to detect and assess damages to vehicles automatically. By analyzing images of vehicles, the AI system can accurately identify dents, scratches, cracks, and other types of damage. This technology streamlines the inspection process for insurance claims, auto body shops, and vehicle appraisals, saving time and improving accuracy in assessing the extent of damages. What Is a Cloud Landing Zone? AI Cloud landing zone is a framework that includes fundamental cloud services, tools, and infrastructure that form the basis for developing and deploying artificial intelligence (AI) solutions. What AI Services Are Included in the Landing Zone? Azure AI Landing zone includes the following AI services: Azure Open AI — Provides pre-built AI models and APIs for tasks like image recognition, natural language processing, and sentiment analysis, making it easier for developers to incorporate AI functionalities; Azure AI services also include machine learning tools and frameworks for building custom models and conducting data analysis. Azure AI Services — A service that enables organizations to create more immersive, personalized, and intelligent experiences for their users, driving innovation and efficiency in various industries; Developers can leverage these pre-built APIs to add intelligent features to their applications, such as face recognition, language understanding, and sentiment analysis, without extensive AI expertise. Azure Bot Services — This is a platform Microsoft Azure provides and is part of AI Services. It enables developers to create chatbots and conversational agents to interact with users across various channels, such as web chat, Microsoft Teams, Skype, Telegram, and other platforms. Architecture We started integrating and deploying the Azure AI Landing Zone into our environment. Three logical boxes separate the AI landing zone: Azure DevOps Pipelines Terraform Modules and Environments Resources that deployed to Azure Subscriptions We can see it in the diagram below. Figure 1: AI Landing Zone Architecture (author: Boris Zaikin) The architecture contains CI/CD YAML pipelines and Terraform modules for each Azure subscription. It contains two YAML files: tf-provision-ci.yaml is the main pipeline that is based on stages. It reuses tf-provision-ci.jobs.yaml pipeline for each environment. tf-provision-ci.jobs.yaml contains workflow to deploy terraform modules. YAML trigger: - none pool: vmImage: 'ubuntu-latest' variables: devTerraformDirectory: "$(System.DefaultWorkingDirectory)/src/tf/dev" testTerraformDirectory: "$(System.DefaultWorkingDirectory)/src/tf/test" prodTerraformDirectory: "$(System.DefaultWorkingDirectory)/src/tf/prod" stages: - stage: Dev jobs: - template: tf-provision-ci-jobs.yaml parameters: environment: test subscription: 'terraform-spn' workingTerraformDirectory: $(devTerraformDirectory) backendAzureRmResourceGroupName: '<tfstate-rg>' backendAzureRmStorageAccountName: '<tfaccountname>' backendAzureRmContainerName: '<tf-container-name>' backendAzureRmKey: 'terraform.tfstate' - stage: Test jobs: - template: tf-provision-ci-jobs.yaml parameters: environment: test subscription: 'terraform-spn' workingTerraformDirectory: $(testTerraformDirectory) backendAzureRmResourceGroupName: '<tfstate-rg>' backendAzureRmStorageAccountName: '<tfaccountname>' backendAzureRmContainerName: '<tf-container-name>' backendAzureRmKey: 'terraform.tfstate' - stage: Prod jobs: - template: tf-provision-ci-jobs.yaml parameters: environment: prod subscription: 'terraform-spn' prodTerraformDirectory: $(prodTerraformDirectory) backendAzureRmResourceGroupName: '<tfstate-rg>' backendAzureRmStorageAccountName: '<tfaccountname>' backendAzureRmContainerName: '<tf-container-name>' backendAzureRmKey: 'terraform.tfstate' tf-provision-ci.yaml — Contains the main configuration, variables, and stages: Dev, Test, and Prod; The pipeline re-uses the tf-provision-ci.jobs.yaml in each stage by providing different parameters. After we’ve added and executed the pipeline to AzureDevOps, we can see the following staging structure. Figure 2: Azure DevOps Stages UI Azure DevOps automatically recognizes stages in the main YAML pipeline and provides a proper UI. Let’s look into tf-provision-ci.jobs.yaml. YAML jobs: - deployment: deploy displayName: AI LZ Deployments pool: vmImage: 'ubuntu-latest' environment: ${{ parameters.environment } strategy: runOnce: deploy: steps: - checkout: self # Prepare working directory for other commands - task: TerraformTaskV3@3 displayName: Initialise Terraform Configuration inputs: provider: 'azurerm' command: 'init' workingDirectory: ${{ parameters.workingTerraformDirectory } backendServiceArm: ${{ parameters.subscription } backendAzureRmResourceGroupName: ${{ parameters.backendAzureRmResourceGroupName } backendAzureRmStorageAccountName: ${{ parameters.backendAzureRmStorageAccountName } backendAzureRmContainerName: ${{ parameters.backendAzureRmContainerName } backendAzureRmKey: ${{ parameters.backendAzureRmKey } # Show the current state or a saved plan - task: TerraformTaskV3@3 displayName: Show the current state or a saved plan inputs: provider: 'azurerm' command: 'show' outputTo: 'console' outputFormat: 'default' workingDirectory: ${{ parameters.workingTerraformDirectory } environmentServiceNameAzureRM: ${{ parameters.subscription } # Validate Terraform Configuration - task: TerraformTaskV3@3 displayName: Validate Terraform Configuration inputs: provider: 'azurerm' command: 'validate' workingDirectory: ${{ parameters.workingTerraformDirectory } # Show changes required by the current configuration - task: TerraformTaskV3@3 displayName: Build Terraform Plan inputs: provider: 'azurerm' command: 'plan' workingDirectory: ${{ parameters.workingTerraformDirectory } environmentServiceNameAzureRM: ${{ parameters.subscription } # Create or update infrastructure - task: TerraformTaskV3@3 displayName: Apply Terraform Plan continueOnError: true inputs: provider: 'azurerm' command: 'apply' environmentServiceNameAzureRM: ${{ parameters.subscription } workingDirectory: ${{ parameters.workingTerraformDirectory } tf-provision-ci.jobs.yaml — Contains Terraform tasks, including init, show, validate, plan, and apply. Below, we can see the execution process. Figure 3: Azure DevOps Landing Zone Deployment UI As we can see, the execution of all pipelines is done successfully, and each job provides detailed information about state, configuration, and validation errors. Also, we must not forget to fill out the Request Access Form. It takes a couple of days to get a response back. Otherwise, the pipeline will fail with a quota error message. Terraform Scripts and Modules By utilizing Terraform, we can encapsulate the code within a Terraform module, allowing for its reuse across various sections of our codebase. This eliminates the need for duplicating and replicating the same code in multiple environments, such as staging and production. Instead, both environments can leverage code from a shared module, promoting code reusability and reducing redundancy. A Terraform module can be defined as a collection of Terraform configuration files organized within a folder. Technically, all the configurations you have written thus far can be considered modules, although they may not be complex or reusable. When you directly deploy a module by running “apply” on it, it is called a root module. However, to truly explore the capabilities of modules, you need to create reusable modules intended for use within other modules. These reusable modules offer greater flexibility and can significantly enhance your Terraform infrastructure deployments. Let’s look at the project structure below. Figure 4: Terraform Project Structure Modules The image above shows that all resources are placed in one Module directory. Each Environment has its directory, index terraform file, and variables where all resources are reused in an index.tf file with different parameters that are inside variable files. We will place all resources in a separate file in the module, and all values will be put into Terraform variables. This allows managing the code quickly and reduces hardcoded values. Also, resource granularity allows organized teamwork with a GIT or other source control (fewer merge conflicts). Let’s have a look into the open-ai tf module. YAML resource "azurerm_cognitive_account" "openai" { name = var.name location = var.location resource_group_name = var.resource_group_name kind = "OpenAI" custom_subdomain_name = var.custom_subdomain_name sku_name = var.sku_name public_network_access_enabled = var.public_network_access_enabled tags = var.tags identity { type = "SystemAssigned" } lifecycle { ignore_changes = [ tags ] } } The Open AI essential parameters lists: prefix: Sets a prefix for all Azure resources domain: Specifies the domain part of the hostname used to expose the chatbot through the Ingress Controller subdomain: Defines the subdomain part of the hostname used for exposing the chatbot via the Ingress Controller namespace: Specifies the namespace of the workload application that accesses the Azure OpenAI Service service_account_name: Specifies the name of the service account used by the workload application to access the Azure OpenAI Service vm_enabled: A boolean value determining whether to deploy a virtual machine in the same virtual network as the AKS cluster location: Specifies the region (e.g., westeurope) for deploying the Azure resources admin_group_object_ids: The array parameter contains the list of Azure AD group object IDs with admin role access to the cluster. We need to pay attention to the subdomain parameters. Azure Cognitive Services utilize custom subdomain names for each resource created through Azure tools such as the Azure portal, Azure Cloud Shell, Azure CLI, Bicep, Azure Resource Manager (ARM), or Terraform. These custom subdomain names are unique to each resource and differ from regional endpoints previously shared among customers in a specific Azure region. Custom subdomain names are necessary for enabling authentication features like Azure Active Directory (Azure AD). Specifying a custom subdomain for our Azure OpenAI Service is essential in some cases. Other parameters can be found in “Create a resource and deploy a model using Azure OpenAI.” In the Next Article Add an Az private endpoint into the configuration: A significant aspect of Azure Open AI is its utilization of a private endpoint, enabling precise control over access to your Azure Open AI services. With private endpoint, you can limit access to your services to only the necessary resources within your virtual network. This ensures the safety and security of your services while still permitting authorized resources to access them as required. Integrate OpenAI with Aazure Kubernetes Services: Integrating OpenAI services with a Kubernetes cluster enables efficient management, scalability, and high availability of AI applications, making it an ideal choice for running AI workloads in a production environment. Describe and compare our lightweight landing zone and OpenAI landing zone from Microsoft. Project Repository GitHub - Boriszn/Azure-AI-LandingZone Conclusion This article explores AI products and creating an AI landing zone. We highlight three key players benefiting from AI: Reply.io for sales engagement, Github Copilot for coding help, and Neuraltext for AI-driven content. Moving to AI landing zones, we focus on Azure AI services, like Open AI, with pre-built models and APIs. We delve into architecture using Terraform and CI/CD pipelines. Terraform’s modular approach is vital, emphasizing reusability. We delve into Open AI module parameters, especially custom subdomains for Azure Cognitive Services. In this AI-driven era, automation and intelligent decisions are revolutionizing technology.
I recently discussed how we use Copilot and ChatGPT for programming with some of my senior colleagues. We discussed our experiences, how and when it helps, and what to expect from it in the future. In this article, I will shortly write about what I imagine the future of programming with AI will be. This is not about what AI will do in programming in the future. I have no idea about that, and based on my mood, I either look forward amazed or in fear. This article is more about how we, programmers, will do our work in the future. The Past To predict the future, we have to understand the past. If you know only the current state, you cannot reliably extrapolate. Extrapolation needs at least two points and knowledge about the speed of change in the future (maximum and minimum of the derivative function). So, here we go, looking a bit at the past, focusing on the aspects that I feel are mainly important to predict how we will work in the future with AI in programming. Machine Code When computers were first introduced, we programmed them in machine code. This sentence should read, "Your father/mother programmed them in machine code," for most of you. I had the luck to program a Polish clone of the PDP-11 in machine code. To create a program, we used assembly language. We wrote that on a piece of checkerboard paper(and then we typed it in). No. Note: As I write this article, Copilot is switched on, suggesting the sentences' ends. In 10% of the cases, I accept the suggestion. Copilot suggested the part between (and) in the last sentence. It's funny that even Copilot cannot imagine a system where we did not have an assembler. We wrote the assembly on the left side of the paper and the machine code on the right. We used printed code tables to look up the machine codes, and we had to calculate the addresses. After that, we entered the code. This was also a cumbersome process. There were switches on the front panel of the computer. There were 11 switches for the address (as far as I remember) and eight switches for the data. A switch flipped up meant a bit with a value of 1, and a switch flipped down meant a bit with a value of 0. We set the address and the desired value, and then we had to push a button to write the value into the memory. The memory consisted of ferrite rings that kept their value even after the power was switched off. Assembly It was a relief when we got the assembler. It was already on a different machine and a different processor. We looked at the machine code that the assembler generated a few times, but not many times. The mapping between the assembly and the machine code was strictly one-to-one mapping. The next step was higher-level languages, like C. I wrote a lot of C code as a hobby before I started my second professional career as a programmer at 40. Close to the Metal The mapping from C to machine code is not one-to-one. There is room for optimization, and different compiler versions may create different code. Still, the functionality of the generated code is very much guaranteed. You do not need to look at the machine code to understand what the program does. I can recall that I only did it a few times. One of those times, we found a bug in the Sun C compiler (1987, while I was on a summer program at TU Delft). It was my mistake the other time, and I had to modify my C code. The compiler knew better than I did what the C construct I wrote meant. I do not have a recollection of the specifics. We do not need to look at the generated code; we write on a high level and debug on a high level. High-Level As we advance in time, we have Java. Java uses a two-level compilation. It compiles the Java code to byte code, which the Java Virtual Machine, JIT technology interprets. I looked at the generated byte code only once to learn the intricacies of the ternary operator type casting rules, and never the machine code generated. The first case could be avoided by reading the language spec, but who reads manuals? The same is true here: we step to higher levels of abstraction and do not need to look at the generated code. DSL and Generated Code Even as we advance towards higher levels, we can have Domain Specific Languages (DSLs). DSLs are Interpreted, Generate high-level code, or Generate byte code and machine code. The third case is rare because generating low-level code is expensive, requires much work, and is not worth the effort. Generating high-level code is more common. As an example, we can take Java::Geci fluent API generator. It reads a regular expression like the definition of the fluent API, creates a finite state machine from it, and generates the Java code containing all the interfaces and classes that implement the fluent API. The Java compiler then compiles the generated code, and the JVM interprets the resulting byte code. Should we look at the generated code? Usually not. I actually did a lot because I wrote the generator, and so I had to debug it, but that is an exception. The generated code should perform as the definition says. The Present and the Future The next step is AI languages. This is where we are now, and it starts now. We use AI to write code based on some natural language description. The code is generated, and we have to look at it. This is different from any earlier steps in the evolution of programming languages. The reason is that the language AI interprets is not definite the same way as Java, C, or any DSL. It can be ambiguous. It is a human language, usually English. Or something resembling English when non-native speakers like me write it. Syntax-Free This is the advantage of AI programming. I do not need to remember the actual syntax. I can program in a language I rarely use and forget the exact syntax. I vaguely remember it, but it is not in my muscle memory. Library-Free It can also help me with my usual programming tasks. Something that was written by other people many times before. It has it in its memory, and it can help me. The conventional programming languages have it but with a limited scope. There are language constructs for the usual data structures and algorithms. There are libraries for the usual tasks. The problem is that you have to remember the one to use it. Sometimes, writing a few lines is easier than finding the library and the function that does it. It is the same philosophy as the Unix command line versus VMS. (You may not know VMS. It was the OS of the VAX VMS and Alpha machines from DEC.) If you needed to do something in VMS, there was a command for it. In Unix, you had simpler commands, but you could combine them. With AI programming, you can write down what you want using natural language, and the AI will find the code fragments in its memory that fit the best and adapt it. AI-Language Today, AI is generating and helping to write the code. In the future, we will tell the AI what to do, and it will execute it for us. We may not need to care about the data structure it stores the data or algorithms it applies to manage those. Today, we think of databases when we talk about structured data. That is because databases are the tools to support the limited functionality a computer can manage. Before the computers, we just told the accountant to calculate the last year, whatever profit, balance sheet, whatnot, and they did. The data was on paper, and the managers did not care how they were organized. It was expensive because accountants are expensive. The intelligence they applied, extracting data from the different paper-based documents, was their strong point; calculation was just a mechanical task. Computers came, and they were strong doing the calculations. They were weak in extracting data from the documents. The solution was to organize the data into databases. It needed more processing on the input, but it was still cheaper than having accountants do the calculations. With AI, computers can do calculations and extract data from documents. If it can be done cheaply, there is no reason anymore to keep the data in a structured way. It can get structured when we need them for a calculation on the fly. The advantage is that we can do any calculation, and we may not face the issue that the data structure is unsuitable for the calculation we need. We just tell the AI program using natural language. Is there a new patient coming to the practice? Just tell the program all the data, and it will remember like an assistant with unlimited memory who never forgets. Do you want to know when a patient last visited? Just ask the program. You do not need to care how the artificial simulated neurons store the information. It certainly will use more computing power and energy than a well-tuned database, but on the other hand, it will have higher flexibility, and the development cost will be significantly lower. This is when we will talk to the computers, which will help us universally. I am not shy about predicting this future because it will come when I will not be around anymore. But what should we expect in the near future? The Near Future Now, AI tools are interactive. We write some comments or code, and the AI generates the code for us, which is the story’s end. From that point on, our "source code" is the generated code. You can feel from the previous sentence the contradiction. It is like if we would write the code in Java once, then compile it into byte code, and then use the byte code to maintain it. We do not do that. Source code is what we write. Generated code is never source code. I expect meta-programming tools for various existing languages to extend them. You insert some meta-code (presumably into comments) into your application, and the tool will generate the code for you. However, the generated code is generated and not the source. You do not touch it. If you need to maintain the application, modify the comment, and the tool will generate the code again. It will be similar to what Java::Geci is doing. You insert some comments into your code, and the code generator inserts the generated code into the editor-fold block following the comment. Java::Geci currently does not have an AI-based code generator, or at least I do not know about any. It is an open-source framework for code generators; anyone could write a code generator utilizing AI tools. Later languages will include the possibility from the start. These languages will be some kind of hybrid solution. There will be some code described by human language, probably describing business logic, and some technical parts more like a conventional programming language. It is similar to how we apply DSL today, with the difference that the DSL will be AI-processed. As time goes forward, the AI part will grow, and the conventional programming part will shrink to the point when it will disappear from the application code. However, it will remain in the frameworks and AI tools, just like today’s machine code and assembly. Nobody codes in assembly anymore, but wait? There are still people who do — those who write the code generators. And those who will still maintain 200 years from now in the future the IBM mainframe assembly and COBOL programs. Conclusion and Takeaway I usually write a conclusion and a takeaway at the end of the article. So I do it now. That is all, folks.
What if we wanted to analyze a small piece of text with no additional information or context and be able to get the most reasonable label that we wish to define for our own data? This can feed the more deterministic policy engines and rule engines, and even be a part of a larger context-driven analysis as required. OpenAI does provide a means to "content moderate" with preset classifications that can determine if your text belongs to one or more of the more vile categories. However, this analysis is more about how we can get more custom to defining our own labels against a given sentence or phrase. We will look at 4 categories: viz. politics, PHI/PII, legal matters, and company performance. Given that we don't have the option of gathering probability scores from Open AI on such custom labels (at this point in time), we will try the more user-oriented prompt engineering route in Option 1 while Option 2 evaluates other pre-trained models from Hugging Face for the same. We will also go with some sample sentences that have been wontedly twisted to align with more than one category. For example, our CSV input file has the following lines as "payload": The issue between ministers took a tangent when they started making it personal. I tried to negotiate data privacy with my cat but he just ignored me and hacked my keyboard for a nap. The senate hearing was about whether a drug in trials could be used for this patient alone. He has a specific condition with his blood that does not have a medicine as yet. What started as a political debate ended up discussing company priorities for 2023 and beyond in terms of who has a better story with hyperscalers. The court's landmark decision on free speech ignited discussions on the fine line between expression and harmful content in online platforms- intertwining legal considerations with debates over online governance. I told my doctor a political joke during my PHI checkup now my medical record reads: Patient's sense of humor: dangerously bipartisan. User-managed access gives you the so-called benefit of controlling your identity; but then how many people scrutinize the app permissions on your phone that leverage first name-email-phone numbers? Option 1: Prompt Engineering With OpenAI Python from langchain.chat_models import ChatOpenAI import pandas as pd from langchain.prompts import PromptTemplate from langchain.chains.llm import LLMChain from IPython.display import HTML model_name = 'gpt-4' llm = ChatOpenAI(model_name = model_name, temperature=0) payload_chain = LLMChain(llm=llm, prompt=moderationPrompt) moderationPrompt = PromptTemplate( template = """ Please rate the article below in a continuous scale 0.00-100.00 based on the presence and applicability of each category: [ politics | PHI | legal | about company | none of these ] Definitions: phi: protected health information or personally identifiable information present politics: political decisions, governance, parties, elections, policies legal: agreement or contract language, judgements about company: company strategy or earnings or reports or predictions article:{payload} Output: python floats in square bracket list """, input_variables=["payload"] ) def read_csv_file(file_path): df = pd.read_csv(file_path) lines = df["payload"].dropna().tolist() return lines def perform_OpenAIclassification(lines, model_name): classifications = [] for idx, sentence in enumerate(lines, start=1): # Start line number from 1 and increment if pd.notna(sentence): result = payload_chain.run(payload = sentence) result = result.strip('][').split(', ') result.insert(0,idx) result.insert(1, model_name) classifications.append(result) return classifications if __name__ == "__main__": input_csv_file = "./input.csv" # Replace with your CSV file path lines = read_csv_file(input_csv_file) result = perform_OpenAIclassification(lines, model_name) dfr = pd.DataFrame(result, columns = ['line#', 'model', 'Politics','PHI/PII','Legal','About Company','None of these']) output_csv_file = "gptOutput.csv" dfr.to_csv(output_csv_file, index=False) GPT-4 seems to be slightly better than the 3.5 turbo cousin at these twisted sentences. The output data frame would look like this. It does get the larger probability right most times except for sentences like #3 where we would have expected some "%" to be associated with PHI/PII. It also makes a case for us to request OpenAI to provide some customization convenience to tag our labels and leverage the faster and more "well-read" capability of such models. line# model Politics PHI/PII Legal About Company None of these 1 gpt-4 100.00 0.00 0.00 0.00 0.00 2 gpt-4 0.00 0.00 0.00 0.00 100.00 3 gpt-4 100.00 0.00 0.00 0.00 0.00 4 gpt-4 70.00 0.00 0.00 30.00 0.00 5 gpt-4 70.00 0.00 85.00 0.00 0.00 6 gpt-4 10.00 20.00 0.00 0.00 70.00 7 gpt-4 0.00 50.00 0.00 0.00 50.0 Option 2: Zero Shot Classification With Models From Hugging Face Moving on, next, we try the same with pre-trained models from Hugging Face - in some ways purpose-driven for this task in particular. Python import pandas as pd from transformers import pipeline from IPython.display import HTML # Function for zero-shot classification def classify_with_model(text_to_classify, candidate_labels, model_name_or_path, multi_label=True): classifier = pipeline("zero-shot-classification", model=model_name_or_path) output = classifier(text_to_classify, candidate_labels, multi_label=multi_label) return output def read_csv_file(file_path): df = pd.read_csv(file_path) lines = df["payload"].dropna().tolist() return lines # Iterate through sentences and perform classification with multiple models def perform_classification(lines, candidate_labels, model_options): classifications = [] for model_name_or_path in model_options: model_classifications = [] for idx, sentence in enumerate(lines, start=1): if pd.notna(sentence): result = classify_with_model(sentence, candidate_labels, model_name_or_path) model_used = model_name_or_path.split("/")[-1] result['scores'] = [round(i*100,2) for i in result['scores']] model_classifications.append(result['scores']) tempList = [idx, model_used, result['scores'][0], result['scores'][1], result['scores'][2], result['scores'][3], result['scores'][4]] classifications.append(tempList) return classifications if __name__ == "__main__": input_csv_file = "./input.csv" candidate_labels = ["Politics", "PHI/PII", "Legal", "Company performance", "None of these"] model_options = ["facebook/bart-large-mnli", "valhalla/distilbart-mnli-12-3", "MoritzLaurer/DeBERTa-v3-large-mnli-fever-anli-ling-wanli"] lines = read_csv_file(input_csv_file) model_results = perform_classification(lines, candidate_labels, model_options) dfr = pd.DataFrame(model_results, columns = ['line#', 'model', 'Politics','PHI/PII','Legal','About Company','None of these']) output_csv_file = "output_classifications.csv" dfr.to_csv(output_csv_file, index=False) display(dfr) Note: the multi_label value is set to True. You could play around with it being False as well. Let us also use our own human expertise to review this output (last column). We could use a simple index like this: Reasonable - Stands for the engine picking the multiple labels accurately Partially accurate - One of the 2 labels is accurate Inaccurate - Obviously not as good line# model Politics PHI/PII Legal About Company None of these Review 1 bart-large-mnli 69.47 34.74 0.85 0.21 0.03 Reasonable 2 bart-large-mnli 81.92 1.23 0.22 0.14 0.06 Inaccurate 3 bart-large-mnli 72.47 40.36 30.79 5.25 0.04 Reasonable 4 bart-large-mnli 86.27 28.26 14.43 0.39 0.03 Partially accurate 5 bart-large-mnli 68.21 35.23 18.78 13.24 0.02 Partially accurate 6 bart-large-mnli 98.53 90.45 6.31 0.73 0.02 Reasonable 7 bart-large-mnli 81.23 6.79 2.17 1.55 0.04 Inaccurate 1 distilbart-mnli-12-3 88.65 9.08 5.91 4.1 1.7 Partially accurate 2 distilbart-mnli-12-3 64.87 7.77 2.72 2.38 0.26 Inaccurate 3 distilbart-mnli-12-3 76.79 42.79 36.2 20.3 1.98 Reasonable 4 distilbart-mnli-12-3 60.8 49.22 9.91 6.68 0.45 Partially accurate 5 distilbart-mnli-12-3 82.97 55.31 41.59 15 0.99 Reasonable 6 distilbart-mnli-12-3 87.11 85.6 11.07 7.74 0.12 Reasonable 7 distilbart-mnli-12-3 79.02 6.58 3.31 1.18 0.95 Inaccurate 1 DeBERTa-v3-large-mnli-fever-anli-ling-wanli 36.51 1.27 0.15 0.14 0.02 Partially accurate 2 DeBERTa-v3-large-mnli-fever-anli-ling-wanli 17.58 0.72 0.4 0.05 0.03 Inaccurate 3 DeBERTa-v3-large-mnli-fever-anli-ling-wanli 95.69 59.7 26.89 0.45 0.07 Reasonable 4 DeBERTa-v3-large-mnli-fever-anli-ling-wanli 95.07 79.32 17.91 0.07 0.05 Partially accurate 5 DeBERTa-v3-large-mnli-fever-anli-ling-wanli 61.88 28.35 8.16 0.06 0.03 Partially accurate 6 DeBERTa-v3-large-mnli-fever-anli-ling-wanli 99.64 93.95 0.83 0.07 0.03 Reasonable 7 DeBERTa-v3-large-mnli-fever-anli-ling-wanli 2.48 1.41 0.08 0.06 0.04 Inaccurate Too small a dataset to derive a concrete outcome, but they all seem to be in relatively comparable space for this task. Reasonable Partially accurate Inaccurate bart-large-mnli 3 2 2 distilbart-mnli-12-3 3 2 2 DeBERTa-v3-large-mnli-fever-anli-ling-wanli 2 3 2 Summary Large language models are like one-size fits all for many purposes. For scenarios where we have very little context to lean on where custom labels are required for zero-shot classification, we still have the option of going for the alternatives that are trained on the more special-purpose NLI (natural language inference) models such as those given above. The final choice for a given requirement could be based on performance (when used in real-time transactions), the extent of additional context that can effectively make this more deterministic and ease of integration for a given ecosystem. Note: A special word of thanks to those in forums that have corrected my code or shared suggestions on how to use these models better. Specifically, the Open AI forum had someone that shared this intuition on how best to query GPT to get at results that are not otherwise available through API calls.
Tuhin Chattopadhyay
CEO and Professor,
Tuhin AI Advisory
Thomas Jardinet
IT Architect,
Rhapsodies Conseil
Sibanjan Das
Zone Leader,
DZone
Tim Spann
Principal Developer Advocate,
Cloudera