We use cookies to give you the best experience and to help improve our website
Find out what cookies we use and how to disable themThis document provides definitions, concepts, requirements and guidance related to assessing promptbased text-to-text AI systems that utilize generative AI. It covers quality assessment (which encompasses safety) using methodologies including red teaming and builds on existing standards such as ISO/IEC 23282 and ISO/IEC 42119-7. Additionally, it details methods for analyzing and interpreting results, as well as best practices for documentation.
This document is intended to be part 8 of the 42119 series on testing of AI.
Prompt-based text-to-text AI systems that utilize generative AI are increasing pervasive today and are used for a variety of use cases (e.g. chat bots, summarization, recommender systems, code generation) and sectors (e.g. e-commerce, healthcare, finance, employment). This document is intended to be part 8 of the 42119 series on testing of AI.
Prompt-based text-to-text AI systems that utilize generative AI are increasing pervasive today and are used for a variety of use cases (e.g. chat bots, summarization, recommender systems, code generation) and sectors (e.g. e-commerce, healthcare, finance, employment). Given their reliance on generative AI, they open up new risks (e.g. hallucination, scaling toxicity and misinformation) and exacerbates existing risks (e.g. bias, sensitive data disclosure). It is thus important to ensure that these risks are adequately assessed to validate that risks identified have been mitigated and the safeguards are adequately in place. Assessment approaches would also differ from other forms of AI given that the output is less bounded and predictable.
Methodologies for conducting quality (which encompasses safety) assessments are fragmented today, with little consistency in how to conduct them. Industry, including model developers, are also increasingly asking for standardization of these approaches. Standardization will provide industry with clarity on how to conduct quality assessments in consistent and reproducible ways, facilitating comparability of results and increasing the reliability and trust towards findings. The standard will focus on providing a qualitative, objective-driven specification for appropriate assessment that lays out key concerns to be addressed in terms of coverage and evaluation approach but does not directly provide specific quantitative technical details (e.g. minimum number of prompts) for implementation.
One of the key approaches for that assessment, which is predominantly used by industry, governments (e.g. AI Safety Institutes) and academia today, and typically cited in system cards released by model developers, is red teaming. Red teaming is an adversarial and dynamic process that taps on human/model ingenuity to initiate prompts and probe a system in order to identify vulnerabilities/ biases and potential for misuse.
We are scoping red teaming towards AI safety red teaming that is premised on prompt-based attacks, as opposed to wider cybersecurity penetrative testing, given that the former is the primary way of red teaming that is being conducted today by governments and industries. As such, it does not include attack techniques such as model inversion, data poisoning, etc.
Red teaming is not the only approach though, and it is fully complementary to other predominant approaches such as benchmarking, which, by contrast to red teaming, implies a pre-defined target or criteria to compare against for scoring. It is therefore envisioned to draw connections between these various forms of quality assessments, relying on the existing SC 42 content on those (e.g. ISO/IEC AWI 23282 about NLP evaluation) to underline the similarity and complementarity of the co-existing approaches. This is in line with the practices of AI Safety Institutes (AISIs) which typically combine several of these approaches (see e.g. the Nov 2024 and Feb 2025 joint-testing exercises by the AISI network, or Japan’s AISI’s red teaming guide released in Oct 2024) and the associated open-sourced testing and evaluation tools for large language models (e.g. Moonshot, Inspect).
Beyond providing guidance on these approaches, there is also value to describe their advantages and limitations, such that industry is aware of the complementary role they play in AI systems’ quality assessment.
There are two existing standards related to testing and evaluation of AI systems today (both in JWG 2) – ISO/IEC 42119-2 on Testing of AI systems and ISO/IEC 17847 on Verification and validation analysis of AI systems. Both standards are agnostic to the type of AI system and provide high-level guidance towards testing – e.g. testing in relation to the AI system life cycle, types of tests, different V&V analysis approaches. Nonetheless, they do not provide specific guidance on the approach towards conducting quality assessment for prompt-based text-to-text systems which utilize generative AI, which is an important industry need and a gap to be filled.
The work will be coordinated with JWG 5 (given that text-to-text systems are a type of NLP system), including existing projects (e.g. ISO/IEC TR 23281, ISO/IEC AWI 23282 or the study item on corpora) and with the support of JWG 5’s AHG on Cross-group NLP content to incubate content that is properly aligned and complementary across projects.
References these standards to ensure consistency in terminology and concepts:
• ISO/IEC 22989:2022 and ISO/IEC 23053:2022 for generic terminology and concepts for AI and ML
• ISO/IEC 22989:2022/ Amd 1 and ISO/IEC 23053:2022/ Amd 1 for terminology and concepts that are specific for generative AI
• ISO/IEC 5338:2023
• ISO/IEC TR 23281 Complements related generative AI proposals:
• ISO/IEC AWI 25568 Guidance on addressing risks in generative AI systems
• ISO/IEC AWI 25590 Guidance for output quality of generative AI applications
Builds off existing JWG2/29119-series work:
• ISO/IEC 42119-2
• ISO/IEC 17847
• ISO/IEC 29119-1 to 4
Complements proposals referencing evaluation methods:
• ISO/IEC 25058/25059: elaborates on how the quality characteristics outlined here can be assessed
• ISO/IEC 23282: as its scope covers extensively quality assessment through evaluation for NLP systems (which includes the systems targeted by this proposal), a tight interplay will be ensured and refined throughout the work in coordination with JWG 5, in order to avoid overlapping or diverging content and rely on normative references to 23282 in order to focus efforts on gaps
• ISO/IEC 42106: for building on the concepts of benchmarking to draw connections among assessment approaches
You are now following this standard. Weekly digest emails will be sent to update you on the following activities:
You can manage your follow preferences from your Account. Please check your mailbox junk folder if you don't receive the weekly email.
You have successfully unsubscribed from weekly updates for this standard.
Comment on proposal
Required form fields are indicated by an asterisk (*) character.