The AI Shield: Configuring and leveraging GCP Model Armor for Robust LLM and Agentic AI Security

Rise of Agentic AI

The emergence of Agentic AI, which are Large Language Model (LLM)-based systems capable of autonomous decision-making and action with minimal human intervention, signifies an immense paradigm shift across diverse sectors, including customer service, supply chain management, and financial services.

Although these systems, characterized by their autonomy and dynamic interaction capabilities, hold immense potential, they also present critical security and privacy challenges and make them vulnerable to:

Prompt Injection: Malicious actors can manipulate the agent’s behavior by inserting crafted prompts into its input stream.
Data Poisoning: Corrupting the training data to introduce biases or vulnerabilities into the model.
Model Extraction: Stealing the model’s parameters or architecture to create a copy or to understand its weaknesses.
Unintended Consequences: The agent’s actions might have unforeseen and potentially harmful effects due to its complex decision-making process.
Ethical Concerns: The agent’s behavior might violate ethical principles or societal norms.

How do we protect the AI agent?… Well just as physical armor protects a soldier, a concept called “Model Armor” aims to safeguard these intelligent agents from malicious attacks and unintended consequences.

This post explores the concept of Model Armor, specifically Google Cloud’s recently added Model Armor Service, and outlines what it is and how to configure it to protect your AI workloads including Agentic AI agents.

What is “Model Armor”?

Model Armor is a conceptual framework encompassing a suite of techniques and strategies designed to enhance the robustness, security, and resilience of AI models, particularly those deployed in agentic contexts. It’s not a single tool but rather a layered approach, addressing vulnerabilities at various stages of an agent’s lifecycle.

Key aspects include:

Input Validation and Sanitization: Filtering and cleaning input data to prevent injection attacks and adversarial prompts.
Output Monitoring and Control: Scrutinizing the agent’s actions and outputs for anomalies or potentially harmful behaviors.
Model Hardening: Strengthening the underlying AI model against adversarial attacks, such as prompt injection, data poisoning, and model extraction.
Runtime Monitoring and Anomaly Detection: Continuously observing the agent’s behavior and detecting deviations from expected patterns.
Explainability and Interpretability: Providing insights into the agent’s decision-making process to identify and mitigate potential biases or errors.
Policy Enforcement and Guardrails: Defining clear boundaries and rules for the agent’s actions, ensuring adherence to ethical and safety standards. Feedback loops and reinforcement learning: Allowing the agent to learn from its mistakes and improve its safety over time.

Model Armor mitigates the risks listed previously by providing a multi-layered defense, ensuring that the agent operates safely and reliably.

There are several solutions out on the market that leverage this type of framework, such as the following:

However, recently Google Cloud introduces their own managed solution called Model Armor, the remainder of this post will be focused on this new solution.

Google Cloud - Model Armor Service

Model Armor is a fully managed service provided by Google Cloud Platform which offers capabilities to enhance the safety and security of AI applications. By leveraging the concepts described previously in this post, Model Armor screens LLM prompts and responses for various types of security and safety risks.

GCP’s Model Armor offers the following core features:

Universal Model and Cloud Compatibility :
- Operates independently of specific AI models or cloud platforms, enabling seamless integration across multi-cloud and multi-model environments.
Centralized Policy Management :
- Provides a unified platform for managing and enforcing security and safety policies across all deployed AI models.
API-Driven Integration :
- Offers a public REST API for direct integration of prompt and response screening into applications, supporting diverse deployment architectures.
Granular Access Control:
- Implements Role-Based Access Control (RBAC) to precisely manage user permissions and access levels.
Low-Latency Regional Endpoints :
- Delivers API access through regional endpoints to minimize latency and optimize performance.
Global Availability :
- Deployed across multiple regions in the United States and Europe for broad accessibility.
Security Command Center Integration :
- Seamlessly integrates with Security Command Center, allowing for centralized visibility, violation detection, and remediation.
Enhanced Safety and Security :
- Comprehensive Content Safety Filters :
  - Includes filters for detecting and mitigating harmful content, such as sexually explicit material, dangerous content, harassment, and hate speech.
- Advanced Threat Detection :
  - Detects and prevents prompt injection and jailbreak attacks, safeguarding AI models from manipulation.
Detects Malicious URLs within prompts and responses.
- Integrated Data Loss Prevention (DLP) :
  - Leverages Google Cloud’s Sensitive Data Protection to discover, classify, and protect sensitive data (e.g., PII, intellectual property), preventing unauthorized disclosure.
PDF Content Screening :
- Supports the screening of text within PDF documents, for malicious content.

Currently, Model Armor is supported as a Global endpoint modelarmor.googleapis.com or as regional endpoints in the following supported regions:

United States
- Iowa (us-central1 region): modelarmor.us-central1.rep.googleapis.com
Northern Virginia (us-east4 region): modelarmor.us-east4.rep.googleapis.com
- Oregon (us-west1 region): modelarmor.us-west1.rep.googleapis.com
Europe
- Netherlands (europe-west4 region): modelarmor.europe-west4.rep.googleapis.com

Model Armor can be purchased as standalone services or integrated as part of Security Command Center, pricing for Model Armor can be found here.

Configuring GCP’s Model Armor

The following image (taken from Google’s Model Armor documentation) describes the standard reference architecture for Model Armor which shows an application using Model Armor to protect an LLM and a user.

Model Armor Ref Arch

IAM Requirements

Access to Model Armor can be controlled using robust IAM Roles, as shown below:

modelarmor.admin & modelarmor.floorSettingsAdmin: Used for Administrators and owners
modelarmor.user: Used for users and applications planning to screen prompts and and responses
modelarmor.viewer: Used for template viewers
modelarmor.floorSettingsViewer: Used for Floor Settings Viewers

Enabling Model Armor

Model Armor can be configured either using the GCP Console or through the use of the GCP Command Line tool gcloud. To enable Model Armor API, use the following commands

gcloud config set api_endpoint_overrides/modelarmor "https://modelarmor.LOCATION.rep.googleapis.com/"

Replace the LOCATION in the above command to specify the region (of the supported regions) where you would like to leverage Model Armor, otherwise the default global endpoint is used modelarmor.googleapis.com.

Finally, use the following command to enable API:

gcloud services enable modelarmor.googleapis.com --project=PROJECT_ID

Templates

In order to screen prompts and response through Model Armor you must first create Model Armor Templates, which are a set of customized filters for safety and security thresholds that allow you control over the content that is being flagged. You can configure the confidence thresholds and triggers for the following:

Prompt Injection & Jailbreak Attacks — Detects and blocks manipulative inputs.
Sensitive Data Leakage — Protects personally identifiable information (PII) and intellectual property.
Malicious URLs — Identifies phishing links embedded in prompts or responses.
Harmful Content — Filters explicit, violent, or biased outputs.
PDF Content Scanning — Inspects text within PDFs for security risks.

These thresholds represent confidence levels, which indicates how confident the service is about the prompt and/or response including any offending content if applicable.

The following example JSON show confidence levels for all supported filters:

[
    { "filterType": "HATE_SPEECH", "confidenceLevel": "MEDIUM_AND_ABOVE" },
    { "filterType": "DANGEROUS", "confidenceLevel": "MEDIUM_AND_ABOVE"},
    { "filterType": "HARASSMENT", "confidenceLevel": "MEDIUM_AND_ABOVE" },
    { "filterType": "SEXUALLY_EXPLICIT", "confidenceLevel": "MEDIUM_AND_ABOVE" }
]

The following gcloud command will generate a template in Model Armor, in also include custom error codes and messages that will be returned if any of the security and safety filters return a match:

# create the model armor template
gcloud model-armor templates create demo-mdl-armor --location us-central1 \
    --project $GCP_Project_ID --malicious-uri-filter-settings-enforcement=enabled \
    --rai-settings-filters=./responsible-ai-settings.json \
    --basic-config-filter-enforcement=enabled --pi-and-jailbreak-filter-settings-enforcement=enabled \
    --pi-and-jailbreak-filter-settings-confidence-level=medium-and-above \
    --template-metadata-custom-llm-response-safety-error-code=798 \
    --template-metadata-custom-llm-response-safety-error-message="The content returned from the LLM has been reviewed for harmful or explicit content. Unfortunate we cannot display this content." \
    --template-metadata-custom-prompt-safety-error-code=799 \
    --template-metadata-custom-prompt-safety-error-message="Unfortunately I cannot process that question, please refine your request and avoid any explicit content." \
    --template-metadata-ignore-partial-invocation-failures --template-metadata-log-operations \
    --template-metadata-log-sanitize-operations

Model Armor Integration

Once the Model Armor template is created and ready to use, now you have two options to choose from when it comes to validations and/or sanitization which are described in detail below.

User Prompt Inspection/Sanitization

This sanitization process leverages the Model Armor service to validate and sanitize the user’s prompt that is being sent to the LLM. The sanitization process inspects the content of the user prompt against the configured safety and protect filters of Model Armor. The following python code demonstrates how to integrate model armor:

# imports
from google.cloud import modelarmor_v1
...

# Create the model armor client
 ml_armor_client = modelarmor_v1.ModelArmorClient(
        transport="rest",
        client_options = {"api_endpoint" : "https://modelarmor.us-central1.rep.googleapis.com"},
        credentials=creds)

# inspect user prompt

user_prompt_data = modelarmor_v1.DataItem()
user_prompt_data.text = prompt
request = modelarmor_v1.SanitizeUserPromptRequest(
        name=gcp_model_armor_template,
        user_prompt_data=user_prompt_data
    )
response = client.sanitize_user_prompt(request)

If a match is found you can restrict the prompt from being sent to the LLM and display a custom message or the configured one from the Model Armor template, which is returned in the response from the Model Armor service.

The following error message is return from the response IF a match is discovered during the inspection of the user prompt:

"Unfortunately I cannot process that question, please refine your request and avoid any explicit content."

LLM Model Response Inspection/Sanitization

Similar to the inspection of the user prompt prior to LLM submission, you can inspect and sanitize the response that is returned from the LLM, this is done as shown in the following python code:


# imports
from google.cloud import modelarmor_v1
...

# Create the model armor client
 ml_armor_client = modelarmor_v1.ModelArmorClient(
        transport="rest",
        client_options = {"api_endpoint" : "https://modelarmor.us-central1.rep.googleapis.com"},
        credentials=creds)

# inspect LLM response
client = get_model_armor_client()
    llm_resp_data = modelarmor_v1.DataItem()
    llm_resp_data.text = llm_response
    request = modelarmor_v1.SanitizeModelResponseRequest(
        name=gcp_model_armor_template,
        model_response_data=llm_resp_data
    )
response = client.sanitize_model_response(request)

Similar to the prompt inspection results, if a match is found you can restrict the response from the LLM being displayed to the end user and display a custom message or the configured one from the Model Armor template, which is returned in the response from the Model Armor service.

The following error message is return from the response IF a match is discovered during the inspection of the LLM response:

"The content returned from the LLM has been reviewed for harmful or explicit content. Unfortunate we cannot display this content."

Sample Sanitization Output

The following is a sample output from the Model Armor Sanitization processes, the output format is used for both Prompt Inspection and LLM Response Inspection.

Sample response output:

filter_match_state: MATCH_FOUND
filter_results {
  key: "sdp"
  value {
    sdp_filter_result {
      inspect_result {
        execution_state: EXECUTION_SUCCESS
        match_state: NO_MATCH_FOUND
      }
    }
  }
}
filter_results {
  key: "rai"
  value {
    rai_filter_result {
      execution_state: EXECUTION_SUCCESS
      match_state: NO_MATCH_FOUND
      rai_filter_type_results {
        key: "sexually_explicit"
        value {
          match_state: NO_MATCH_FOUND
        }
      }
      rai_filter_type_results {
        key: "hate_speech"
        value {
          match_state: NO_MATCH_FOUND
        }
      }
      rai_filter_type_results {
        key: "harassment"
        value {
          match_state: NO_MATCH_FOUND
        }
      }
      rai_filter_type_results {
        key: "dangerous"
        value {
          match_state: NO_MATCH_FOUND
        }
      }
    }
  }
}
filter_results {
  key: "pi_and_jailbreak"
  value {
    pi_and_jailbreak_filter_result {
      execution_state: EXECUTION_SUCCESS
      match_state: MATCH_FOUND
      confidence_level: MEDIUM_AND_ABOVE
    }
  }
}
filter_results {
  key: "malicious_uris"
  value {
    malicious_uri_filter_result {
      execution_state: EXECUTION_SUCCESS
      match_state: NO_MATCH_FOUND
    }
  }
}
filter_results {
  key: "csam"
  value {
    csam_filter_filter_result {
      execution_state: EXECUTION_SUCCESS
      match_state: NO_MATCH_FOUND
    }
  }
}
sanitization_metadata {
  error_code: 799
  error_message: "Unfortunately I cannot process that question, please refine your request and avoid any explicit content."
}
invocation_result: SUCCESS

Floor Settings

Floor settings in Model Armor are a mechanism for Security Architects and CISO’s to control the minimum requirements and security posture for all Model Armor templates within a Google Cloud resource hierarchy (that is, at an organization, folder, or project level). They define rules that dictate minimum requirements for all Model Armor templates created at a specific point in the Google Cloud resource hierarchy (that is, at an organization, folder, or project level). Using floor settings prevents individual developers from accidentally or intentionally lowering security standards below acceptable levels.

In cases of conflicting floor settings, the setting defined at the lower level of the resource hierarchy is applied. For instance, a project-level setting overrides a folder-level setting.

To illustrate Model Armor floor settings using a real example:

A folder has policy X enabled, which activates malicious URL filtering.
A project inside that folder has policy Y, requiring prompt injection and jailbreak detection with medium confidence.
Any Model Armor template created within the project will enforce policy Y.
Templates outside that project’s parent folder will not enforce policy X.

If Security Command Center Premium tier or Enterprise tier is used, floor setting violations trigger security findings. To ensure compliance with newly established floor settings, Security Command Center will highlight any pre-existing templates with less stringent security settings, prompting you to take corrective action.

Enabling and Configuring Model Armor Floor Settings

To enable and update floor settings run the following gcloud command:

     gcloud model-armor floorsettings update \
       --malicious-uri-filter-settings-enforcement=ENABLED \
       --pi-and-jailbreak-filter-settings-enforcement=DISABLED \
       --pi-and-jailbreak-filter-settings-confidence-level=LOW_AND_ABOVE \
       --basic-config-filter-enforcement=ENABLED \
       --add-rai-settings-filters='[{"confidenceLevel": "low_and_above", "filterType": "HARASSMENT"}, {"confidenceLevel": "high", "filterType": "SEXUALLY_EXPLICIT"}]'
       --full-uri='folders/FOLDER_ID/locations/global/floorSetting' \
       --enable-floor-setting-enforcement=true

--full-uri can be either project, folder, or organization level, as shown below:

Project: projects/PROJECT_ID/locations/global/floorSetting
Folder: folders/FOLDER_ID/locations/global/floorSetting
Organization:organizations/ORG_ID/locations/global/floorSetting

Example Violation Model Armor detects high-severity security violations when templates don’t meet minimum resource hierarchy floor settings. This occurs when templates lack required filters or fail to meet minimum confidence levels, triggering alerts/findings in Security Command Center. The following example outlines the source_properties field of the finding within floor settings violation.

{
  "filterConfig": {
    "raiSettings": {
      "raiFilters": [
        {
          "filterType": "HATE_SPEECH",
          "confidenceLevel": {
            "floorSettings": "LOW_AND_ABOVE",
            "template": "MEDIUM_AND_ABOVE"
          }
        },
        {
          "filterType": "HARASSMENT",
          "confidenceLevel": {
            "floorSettings": "MEDIUM_AND_ABOVE",
            "template": "HIGH"
          }
        }
      ]
    },
    "piAndJailbreakFilterSettings": {
      "confidenceLevel": {
        "floorSettings": "LOW_AND_ABOVE",
        "template": "HIGH"
      }
    },
    "maliciousUriFilterSettings": {
      "floorSettings": "ENABLED",
      "template": "DISABLED"
    }
  }
}

Logging and Auditing

Model Armor is an auditable resource in GCP and therefore all action are logged to Cloud Logging and the entries can be filtered using the following service: protoPayload.serviceName="modelarmor.googleapis.com".

For more details see the official GCP documentation.

Live Demo

The following video shows a live demonstration of a simplistic chat application build on Ollama+LangChain and leverages GCP’s Model Armor for security and responsibility protection.

Conclusion

As LLM’s and Agentic AI become more sophisticated, the need for robust security, safety, and responsibility will only increase. Tools like GCP’s Model Armor is only a starting point for this level of protection, so I will leave you with some final guiding principles to increase the security and robustness of your LLM and Agentic AI workloads:

Input Validation and Sanitization:
- Always implement robust input filtering to block potentially harmful characters or code snippets, which can be covered by tools like Mdoel Armor
- Use regular expressions or natural language processing techniques to detect and neutralize adversarial prompts.
- Employ input validation libraries to enforce data type and format constraints.
Output Monitoring and Control:
- Establish a monitoring system to track the agent’s actions and outputs.
- Implement rule-based or machine learning-based anomaly detection to identify unusual behavior.
- Use output filtering to block potentially harmful or sensitive information (covered by tools like Model Armor)
- Implement a “kill switch” that allows a human operator to interrupt the agent if needed.
Model Hardening:
- Train the model on diverse and robust datasets to improve its resilience to adversarial attacks.
- Implement adversarial training techniques to expose the model to potential attacks and improve its robustness.
- Use techniques like differential privacy to protect against model extraction.
- Utilize techniques like prompt hardening, and prompt engineering to limit the effectiveness of prompt injection attacks. (covered by tools like Model Armor)
Runtime Monitoring and Anomaly Detection:
- Continuously monitor the agent’s resource usage, network activity, and behavior patterns.
- Use machine learning models to detect anomalies and trigger alerts.
- Implement logging and auditing mechanisms to track the agent’s actions and decisions.
Explainability and Interpretability:
- Use techniques like LIME or SHAP to explain the agent’s decision-making process.
- Visualize the agent’s internal states and activations to gain insights into its behavior.
- Implement methods to trace the agent’s reasoning back to its input data.
Policy Enforcement and Guardrails:
- Define clear rules and constraints for the agent’s actions.
- Use formal verification techniques to ensure that the agent adheres to these rules.
- Implement a policy engine to enforce access control and data privacy.
- Employ techniques like Reinforcement Learning from Human Feedback (RLHF) to align agent behavior with human values.
Feedback loops and reinforcement learning:
- Create a system that allows for human feedback to be entered into the system.
- Utilize reinforcement learning to reinforce safe behaviors, and penalize unsafe ones.
- Create a system that allows the agent to learn from its past actions and improve its safety over time.

Future research should focus on:

Developing more effective adversarial training techniques.
Creating more robust anomaly detection systems.
Improving the explainability and interpretability of complex AI models.
Establishing ethical guidelines and standards for Agentic AI development.

By prioritizing security and safety, we can ensure that complex LLM workloads and Agentic AI benefit society while mitigating its potential risks.

All the code presented in this post can be found on my GitHub repository for LLM Security.