AI Guardrails

Arvind Saraf
8 min readAug 31, 2023

--

Introduction

Artificial intelligence is becoming more prevalent in daily life. As a new yet powerful technology, it risks being deliberately or inadvertently misused to harm an individual or a group. Rights could easily be challenged should some guardrails not be implemented in these systems. This document proposes some ideas on guidelines/guardrails a conscious organization can put around it’s AI systems or be a part government regulatory framework for AI. We quickly look at some existing standards & tools.

Understanding AI

AI systems can be broadly classified into multiple categories. A typical release of an AI system deploys a new ML model, which usually follows certain standard processes. The kinds of AI issues depend on the type of the AI system, and the lifecycle offers an intervention point to prevent these issues.

Below are some broad categories of AI, perhaps non-exhaustive, that most AI systems can be classified into. Please note that this is not a standard published classification, but categories based on the kinds of prevalent AI output, input & risks, to better align with the objective of managing risks. An AI system may be, even likely to be, some combination of the below categories, rather than falling into one category.

Discriminative AI

These systems use the organization’s internal data to either build further value or expose the data to the customers. No external data outside the system or publicly available may be used. Examples:

  1. E-commerce product recommendation system: Uses product metadata and/or users’ prior behavior on the e-commerce site to recommend newer products of interest.
  2. Customer support chatbot, for say order status and product details. Uses the order & product catalog as the source of the truth. Even if the language inference & response are trained from a public large language model (LLM), the information being exposed is still company-internal.

Summarization AI

These systems expose existing publicly available information, possibly combined, in a structured or summarized format. Primarily, no new creation may happen. Unlike Generative AI, these systems may use publicly available information. Examples:

  1. An LLM falls into this category because it presents information from the internet, but LLMs also do a fair bit beyond information compilation & do generation too.
  2. Any encoding system — e.g. CLIP text embedding models will fall into the category.

Generative AI

Generative artificial intelligence or generative AI is a type of artificial intelligence (AI) system capable of generating text, images, or other media in response to prompts. Generative AI models learn the patterns and structure of their input training data and then generate new data that has similar characteristics. Examples.

  1. Image & video diffusion models such as Dall-E, Midjourney, and Runway — fall into these categories.
  2. ChatGPT. Combining existing information after a point starts becoming like creative generation introducing additional risks around copyright etc. — hence the rationale to differentiate from the Summarization AI.

Agent

Most AI systems described above may show information, and even recommend a course of action to a user, but do not necessarily act on another system. An AI agent actually acts on the decision on another set of systems usually using the external systems API. This action without human vetting may introduce additional risks, hence the differentiation.

For the purpose of this classification, we call an Agent the logic acting on the external system, excluding any possible AI that goes into the decision-making. Examples:

  1. Chatbot that, basis allows users to place or cancel orders.
  2. Financial instrument trading agent that makes the buy or sell transactions on the equities.

Embodied AI (robots etc)

Embodied AI systems have physical form & move in the environment, bringing in a different set of risks around safety, etc. Like with an agent, we exclude the AI logic from this definition — to enable us to segregate the risk of physical form.

It’s easy to see how the real physical systems combine multiple of these:

  1. Chatbot is an Agent using Discriminative AI.
  2. Trading bot: Agent, Discriminative, Summarization
  3. Siri: Summarization, Generative, and if it can pull up information say from your Excel sheets or email on your behalf, also Discriminative & Agent.

An AI system may output Text, Images, Video, Software code — or any combination of these. The control mechanisms of the AI systems depend on the format of the AI output too.

AI lifecycle: ML ModelOps

Unless a traditional software system where the decision logic is coded by the developers, an AI system learns the logic as a model from a sufficient number of examples (ground truth) to be fed to it. Any updates, corrections, and improvements typically involve the updation of the model, possibly even the type of model requiring code changes.

This model development, release, monitoring & update process is fast becoming a mature workflow & is called MLOps (drawing a parallel from DevOps for similar processes for software). The following MLOps stages give an indication of possible interventions to bring in AI regulatory interventions.

  1. Design: The AI problem definition — i.e. inputs to the model, the output, constraints, checks & requirements are laid out. This is like the software requirements document.
  2. Data collection/labeling: Machine learning learns from the data fed to it. This data could be fetched from public sources, and live software systems, or could be manually captured by human annotators (“labeled”) in annotation software. Sufficient correct & representative data is vital to the right AI model being learned.
  3. Model selection & coding: Different AI models work for different kinds of problems. The right model, either publicly available/open-source, must be selected & possibly modified, or a fresh new model or implementation must be done.
  4. Training: The implemented model must be trained on the data collected to be usable. Training samples from the data collected, with the sampling process offering some intervention points to handle possible model issues.
  5. Evaluation: AI models are probabilistic, not deterministic. Various metrics of model performance such as precision, and recall on a subset of collected data, sliced further into categories, are used for the statistical soundness of the model before it is put out in production.
  6. Post-release: Any software system needs constant monitoring. The real-world problem that software or the ML is trained on may change, or some cases/situations not handled well. Constant monitoring mechanisms, the ability to report an issue & quickly repair it, and periodically improve/retrain the models are important even after the initial model deployment.

A good MLOps system should follow the above processes, with provision for reviews/evaluations at each, and the ability to revert back a model quickly should a bug or an issue be detected.

Model deployment scenarios

Basis the use case, the AI model may be trained on or run inferencing on the server or the customer device. These choices may matter for risks such as those around privacy.

AI system requirements

An AI system, like any software, must solve a certain use case. It should do so while respecting basic human principles & rights, often enshrined in the law of the land. Regulatory frameworks bring in a structural framework to ensure these principles are not violated, and there are mechanisms to fix promptly if & when they are.

An AI system should:

  1. Be non-misleading: An AI system should be correct (sufficiently, statistically) for all or individual sets of people (“unbiased”). Even indeliberate inaccuracies or biases can mislead the users.
  2. Be harmless: An AI system be safe for humans, physically or mentally (non-toxic). This requires ensuring they are physically safe & unbiased. Even an advertent flaw may lead to an agent not hoarding the resources to deprive humans or other systems of the resources (non-resource hoarding).
  3. Respect for individual choices: An AI system often captures individual information that must be securely managed. Individual choice to control both, how this information is revealed or used (privacy), as well as what information the user should see (censorship) should be respected by the AI system.
  4. Encourage creativity & innovation, i.e. protection of copyright & intellectual property.
  5. Have adequate processes & controls: No systems are perfect, but processes & controls reduce the likelihood of errors & time to fix any errors/issues seen. Auditable and accountable systems ensure mandated visibility/transparency of the system status, logic & process. Formalized customer redressal process with SLAs allows unexpected issues to be addressed.

Thus, an AI system must be the following:

  1. Correct
  2. Unbiased
  3. Safe
  4. Non-toxic
  5. Non-resource hoarding
  6. With Censorship controls
  7. Privacy-aware
  8. Auditable & accountable
  9. Redressable for Customer complaints

Proposed AI guardrails

  1. Any organization building AI systems begins by deciding the kinds of AI it is implementing, and the risks. One framework is in this doc.
  2. Basis the risk assessment, some or all of the controls/reporting mechanisms must be implemented — along with the stated frequency as required.
  3. For many controls, some existing industry standards or companies offering existing controls are reviewed & a subset of those that be used is recommended, along with the best practices for using them. A preliminary work-in-practice review of existing industry standards is below.
  4. Additionally, the org should implement:
  • Risk assessment periodically (at least Quarterly?)
  • Expose API endpoints to test out the system, as described below in the Auditability / Accountability control to an internal or external authority, if there is one.

Exceptions can be made but must be called out explicitly in design — e.g. if using an attribute is core to AI’s functioning & correctness, and doesn’t violate any principles (e.g. medical diagnostic AI will evaluate differently based on regional propensities for certain conditions, and this is acceptable)

We detail the interventions for the individual system requirements below. The issues & mechanisms to control them are active research areas. One should expect this document to be state of the art, but must constantly be updated as technology advancements.

Requirements & interventions

This document lists the issues & interventions in each type of AI.

Existing standards/tools to implement the AI goals

  • Opting out from being used/scanned by (Gen) AI systems:
  1. Consent-respecting models: e.g. Spawning APIs— the AI systems have to respect
  2. Ai.txt (inspired by robots.txt )
  3. Other web mechanisms: Robots.txt to prevent scanning, noindex metadata tag on the pages, Terms & Conditions (though typically not scanned)
  • Standard options for media provenance:
  1. Coalition for Content Provenance and Authenticity (C2PA): An open technical standard (created by Adobe, Arm, Intel, Microsoft etc.) providing publishers, creators, and consumers the ability to trace the origin of different types of media. Handles Privacy and provenance (Attribution/Copyright) — even for AI-generated images

2. Google’s IPTC photo metadata

  • AI Generation: Possibly use the “Special Instructions” field to custom fields
  • Copyright Notice, Rights Usage Terms fields
  • Confidence Score: Possibly use custom fields
  • Source, Credit fields can be used for original sources used in AI training

3. As a non-owning platform, allow companies to at least implement either of the 2 — specific fields to mention.

  • Toxicity:
  1. Text:
  • Perspective API (using Jigsaw by Google) to rate text content, filter & score on toxicity.
  • Detoxify — open source model for toxic comment classification.

2. Image and video toxicity detection tools — to identify or create requirements for

  • Bias:
  1. ISO/IEC TR 24027:2021 Information technology — Artificial intelligence (AI) — Bias in AI systems and AI aided decision making.
  • Red team datasets (useful for Toxicity, Bias):
  1. Meta’s adversarial dialogue dataset
  2. Anthropic red team attempts
  3. AI2’s Real toxicity prompts
  • New training approaches to reduce Toxicity:
  1. Anthropic’s Constitutional AI approach by creating a standard & create a model for Harmfulness evaluation additionally as proposed by the approach using Reinforcement learning human feedback (RLHF)
  2. Meta Research’s Dynabench

--

--

Arvind Saraf
Arvind Saraf

Written by Arvind Saraf

Arvind (http://www.linkedin.com/in/arvind-saraf/) is a Computer engineer (IIT, MIT, Google) turned technology/impact entrepreneur.