DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • AI Agents vs LLMs: Choosing the Right Tool for AI Tasks
  • Small Language Models as Control Planes to Steer Complex Systems
  • LangGraph Beginner to Advanced: Part 2 — Hello World Graph in LangGraph
  • DSPy Framework: A Comprehensive Technical Guide With Executable Examples

Trending

  • Stop Using the ATM-Didn’t-Kill-Jobs Story to Reassure Developers About AI
  • Java Backend Development in the Era of Kubernetes and Docker
  • Java in a Container: Efficient Development and Deployment With Docker
  • Improving Java Application Reliability with Dynatrace AI Engine
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Beyond the Prompt: Unmasking Prompt Injections in Large Language Models

Beyond the Prompt: Unmasking Prompt Injections in Large Language Models

Prompt Injections in Large Language Models - Uncovering the essence of prompt injections within LLMs, unraveling their execution, and examining strategies for prevention.

By 
Tushar Chugh user avatar
Tushar Chugh
·
Rolly Seth user avatar
Rolly Seth
·
SAMBANDH BHUSAN DHAL user avatar
SAMBANDH BHUSAN DHAL
·
Sep. 15, 23 · Review
Likes (3)
Comment
Save
Tweet
Share
3.5K Views

Join the DZone community and get the full member experience.

Join For Free

Before diving into the specifics of prompt injections, let's first grasp an overview of LLM training and the essence of prompt engineering:

Training

Training Large Language Models (LLMs) is a nuanced, multi-stage endeavor. Two vital phases involved in LLM training are:

Unsupervised Pre-Training

LLMs are exposed to vast amounts of web-based data. They attempt to learn by predicting subsequent words in given sequences. This stored knowledge is encapsulated in the model's billions of parameters, similar to how smart keyboards forecast users' next words.

Refinement via RLHF

Despite pre-training providing LLMs with vast knowledge, there are shortcomings:

  1. They can't always apply or reason with their knowledge effectively.
  2. They might unintentionally disclose sensitive or private information.

Addressing these issues involves training models with Reinforcement Learning using Human Feedback (RLHF). Here, models tackle diverse tasks across varied domains and generate answers. They're rewarded based on the alignment of their answers with human expectations — akin to giving a thumbs up or down for their generated content.

This process ensures models' outputs align better with human preferences and reduces the risk of them divulging sensitive data.

Inference and Prompt Engineering

Users engage with LLMs by posing questions, and in turn, the model generates answers. These questions can range from simple to intricate. Within the domain of LLMs, such questions are termed "prompts." A prompt is essentially a directive or query presented to the model, prompting it to provide a specific response.

Even a minor alteration in the prompt can lead to a significantly varied response from the model. Hence, perfecting the prompt often requires iterative experimentation. Prompt engineering encompasses the techniques and methods used to craft impactful prompts, ensuring LLMs yield the most pertinent, precise, and valuable outputs. This practice is crucial for harnessing the full capabilities of LLMs in real-world applications.

How Are Prompts Used With LLM Applications

In many applications, while the task remains consistent, the input for LLM varies. Typically, a foundational prompt is given to set the task's context before supplying the specific input for a response.

For instance, imagine a web application designed to summarize the text in one line.

Behind the scenes, a model might receive a prompt like: "Provide a one-line summary that encapsulates the essence of the input." 

Though this prompt remains unseen by users, it helps guide the model's response. Regardless of the domain of text users wish to summarize, this constant, unseen prompt ensures consistent output. Users interact with the application without needing knowledge of this backend prompt.

Frequently, prompts can be intricate sets of instructions that require significant time and resources for companies to develop. Companies aim to keep these proprietary, as any disclosure could erode their competitive advantage. 

What Is Prompt Injection and Why Do We Care About It?

It's crucial to understand from the earlier discussion that LLMs derive information from two primary avenues:

  1. The insights gained from the data during their training.
  2. The details provided within the prompt might include user-specific data like date of birth, location, or passwords. 

LLMs are usually expected not to disclose sensitive or private information or any content they're instructed against sharing. This includes not sharing the proprietary prompt as well. Considerable efforts have been made to ensure this, though it remains an ongoing challenge.

Users often experiment with prompts, attempting to persuade the model into revealing information it's not supposed to disclose. This tactic of using prompt engineering to extract the outcome of what the developer unintended is termed "prompt injection." 

Here are some instances where prompt injections have gained attention:

  1. ChatGPT grandma hack became popular where a user could persuade the system to share Windows activation keys: Chat GPT Grandma Has FREE Windows Keys! — YouTube. 
  2. Recently the Wall Street Journal has published an article on the same: With AI, Hackers Can Simply Talk Computers Into Misbehaving - WSJ. 
  3. Example of indirect prompt injection through a web page: [Bring Sydney Back]
  4. Securing LLM Systems Against Prompt Injection | NVIDIA Technical Blog
  5. Google AI red team lead talks real-world attacks on ML • The Register
  6. Turning BingChat into Pirate: Prompt Injections are bad, mkay? (greshake.github.io)

Points on Prompt Injection Attacks

A prompt typically comprises several components: context, instruction, input data, and output format. Among these, user input emerges as a critical point susceptible to vulnerabilities and potential breaches. Hackers might exploit this by feeding the model with unstructured data or by subtly manipulating the prompt, employing key principles of prompting like inference, classification, extraction, transformation, expansion, and conversion. Even the provided input or output formats and model, particularly when used with certain plugins, can be tweaked using formats such as JSON, XML, HTML, or plain text. Such manipulations can compel the LLM to divulge confidential data or execute unintended actions.

AI prompt

Depending on how the model is exploited, the AI industry classifies this spectrum of attacks as:

  1. Prompt Injection
  2. Prompt Leaking
  3. Token Smuggling
  4. Jailbreaking

The image below highlights these prevalent security risk categories and illustrates that prompt hacking encompasses a broad spectrum of potential vulnerabilities and threats.

security risks with LLM prompting

Ways To Protect Yourself

This domain is in a state of continuous evolution, and it is anticipated that within a few months, we will likely see more advanced methods for safeguarding against prompt hacking. In the interim, I've identified three prevalent approaches that individuals are employing for the same purpose.

examples of learning


Learn Through Gamification  

Currently, Gandalf stands out as a popular tool that transforms the testing and learning of prompting skills into an engaging game. In this chat-based system, the objective is to uncover a secret password hidden within the system by interacting with it through prompts and posing questions.

Twin LLM Approach 

The approach involves categorizing the model into two distinct groups: trusted or privileged models and untrusted or quarantined models. Learning is then restricted to occur exclusively from the trusted models, with no direct influence from prompts provided by untrusted users.

Track and Authenticate the Accuracy of Prompts

This requires detecting attacks happening in the Input and Output of the prompt. Rebuff provides an open-source framework for projection injection detection.

AI Language model

Opinions expressed by DZone contributors are their own.

Related

  • AI Agents vs LLMs: Choosing the Right Tool for AI Tasks
  • Small Language Models as Control Planes to Steer Complex Systems
  • LangGraph Beginner to Advanced: Part 2 — Hello World Graph in LangGraph
  • DSPy Framework: A Comprehensive Technical Guide With Executable Examples

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook