British Standards Institution

Source:: ISO/IEC
Committee:: ART/1 - Artificial Intelligence
Categories:: Information management | Standardization. General rules

Scope

This document provides definitions, concepts, requirements and guidance related to assessing promptbased text-to-text AI systems that utilize generative AI. It covers quality assessment (which encompasses safety) using methodologies including red teaming and builds on existing standards such as ISO/IEC 23282 and ISO/IEC 42119-7. Additionally, it details methods for analyzing and interpreting results, as well as best practices for documentation.

Purpose

This document is intended to be part 8 of the 42119 series on testing of AI.

Prompt-based text-to-text AI systems that utilize generative AI are increasing pervasive today and are used for a variety of use cases (e.g. chat bots, summarization, recommender systems, code generation) and sectors (e.g. e-commerce, healthcare, finance, employment). Given their reliance on generative AI, they open up new risks (e.g. hallucination, scaling toxicity and misinformation) and exacerbates existing risks (e.g. bias, sensitive data disclosure). It is thus important to ensure that these risks are adequately assessed to validate that risks identified have been mitigated and the safeguards are adequately in place. Assessment approaches would also differ from other forms of AI given that the output is less bounded and predictable.

Methodologies for conducting quality (which encompasses safety) assessments are fragmented today, with little consistency in how to conduct them. Industry, including model developers, are also increasingly asking for standardization of these approaches. Standardization will provide industry with clarity on how to conduct quality assessments in consistent and reproducible ways, facilitating comparability of results and increasing the reliability and trust towards findings. The standard will focus on providing a qualitative, objective-driven specification for appropriate assessment that lays out key concerns to be addressed in terms of coverage and evaluation approach but does not directly provide specific quantitative technical details (e.g. minimum number of prompts) for implementation.

One of the key approaches for that assessment, which is predominantly used by industry, governments (e.g. AI Safety Institutes) and academia today, and typically cited in system cards released by model developers, is red teaming. Red teaming is an adversarial and dynamic process that taps on human/model ingenuity to initiate prompts and probe a system in order to identify vulnerabilities/ biases and potential for misuse.

We are scoping red teaming towards AI safety red teaming that is premised on prompt-based attacks, as opposed to wider cybersecurity penetrative testing, given that the former is the primary way of red teaming that is being conducted today by governments and industries. As such, it does not include attack techniques such as model inversion, data poisoning, etc.

Red teaming is not the only approach though, and it is fully complementary to other predominant approaches such as benchmarking, which, by contrast to red teaming, implies a pre-defined target or criteria to compare against for scoring. It is therefore envisioned to draw connections between these various forms of quality assessments, relying on the existing SC 42 content on those (e.g. ISO/IEC AWI 23282 about NLP evaluation) to underline the similarity and complementarity of the co-existing approaches. This is in line with the practices of AI Safety Institutes (AISIs) which typically combine several of these approaches (see e.g. the Nov 2024 and Feb 2025 joint-testing exercises by the AISI network, or Japan’s AISI’s red teaming guide released in Oct 2024) and the associated open-sourced testing and evaluation tools for large language models (e.g. Moonshot, Inspect).

Beyond providing guidance on these approaches, there is also value to describe their advantages and limitations, such that industry is aware of the complementary role they play in AI systems’ quality assessment.

There are two existing standards related to testing and evaluation of AI systems today (both in JWG 2) – ISO/IEC 42119-2 on Testing of AI systems and ISO/IEC 17847 on Verification and validation analysis of AI systems. Both standards are agnostic to the type of AI system and provide high-level guidance towards testing – e.g. testing in relation to the AI system life cycle, types of tests, different V&V analysis approaches. Nonetheless, they do not provide specific guidance on the approach towards conducting quality assessment for prompt-based text-to-text systems which utilize generative AI, which is an important industry need and a gap to be filled.

The work will be coordinated with JWG 5 (given that text-to-text systems are a type of NLP system), including existing projects (e.g. ISO/IEC TR 23281, ISO/IEC AWI 23282 or the study item on corpora) and with the support of JWG 5’s AHG on Cross-group NLP content to incubate content that is properly aligned and complementary across projects.

References these standards to ensure consistency in terminology and concepts:

• ISO/IEC 22989:2022 and ISO/IEC 23053:2022 for generic terminology and concepts for AI and ML

• ISO/IEC 22989:2022/ Amd 1 and ISO/IEC 23053:2022/ Amd 1 for terminology and concepts that are specific for generative AI

• ISO/IEC 5338:2023

• ISO/IEC TR 23281 Complements related generative AI proposals:

• ISO/IEC AWI 25568 Guidance on addressing risks in generative AI systems

• ISO/IEC AWI 25590 Guidance for output quality of generative AI applications

Builds off existing JWG2/29119-series work:

• ISO/IEC 42119-2

• ISO/IEC 17847

• ISO/IEC 29119-1 to 4

Complements proposals referencing evaluation methods:

• ISO/IEC 25058/25059: elaborates on how the quality characteristics outlined here can be assessed

• ISO/IEC 23282: as its scope covers extensively quality assessment through evaluation for NLP systems (which includes the systems targeted by this proposal), a tight interplay will be ensured and refined throughout the work in coordination with JWG 5, in order to avoid overlapping or diverging content and rely on normative references to 23282 in order to focus efforts on gaps

• ISO/IEC 42106: for building on the concepts of benchmarking to draw connections among assessment approaches

Comment on proposal

Required form fields are indicated by an asterisk (*) character.

How important do you think standardization is in this area? *

Do you agree that a standard on this subject is feasible?: *

Please give reasons for your selections above: *

Would you or your organization use the standard? *

Would you or your organization be prepared to participate in the development of the standard or to comment on the draft when it is available? *

Are you aware of any regulation, existing standards and other good practice information in this area in the UK (e.g. Industry codes of practice; company specifications, international or European Standards)? *

Would the development of a standard(s) in this area have a particular impact/relevance to SME, consumer, environmental or societal interests? *

Are you aware of any other organizations to which this proposal may be relevant? If so please provide details below. *

Additional comments on the scope or proposal:

Although BSI will not usually enter into correspondence regarding individual comments or suggestions we may wish to contact you to seek further clarification. Please indicate whether this will be acceptable: *

Discard

Please email further comments to: debbie.stead@bsigroup.com

Standard timeline

1. Proposal

Proposal start date:

20/03/2025

Proposal end date:

14/05/2025

2. Draft

3. Public Comments

4. Comment Resolution

5. Approval

6. Publication

Learn more about the standards development process

Standards Development

ISO/IEC NP TS 42119-8 Artificial intelligence — Testing of AI — Part 8: Quality assessment of prompt-based text-to-text systems that utilize generative AI

Scope

Purpose

Comment on proposal

Discard changes?

Submit comment

Submit comment

Follow standard

Unfollow standard

Error