Testing the AI
/Testing the AI
AI development is moving quickly, so rather than focus on where they stand at this point in time, we'll introduce a few methods you and your team can use to test the efficacy of AI models to your specific use case.
Co-design
Co-design is an important design methodology where you bring your community in early in the process of building or deploying a tech tool. By inviting others in early, you can greatly reduce the risk of building or buying something that just won't work for your community. For example, you might learn that an AI tool you were planning to use doesn't work with the language your community speaks.
The Situate AI Guidebook: Co-Designing a Toolkit to Support Multi-Stakeholder Early-stage Deliberations Around Public Sector AI Proposals
Intersectional Spaces of Participation: Inclusive, Resilient, Embedded (INSPIRE): "INSPIRE will test co-design of participatory spaces in seven pilots involving different groups historically marginalised in policymaking and public life."
OECD case studies of open co-design processes led by government agencies, including NASA.
User testing
Like co-design, user testing tools and methodologies help you see how your intended users interact with a given platform. The Citizen User Testing Group methodology pays people from your intended community to participate in the user testing program to ensure representative results.
User testing resources:
The free CUT Group book lays out how to start and run a Citizen User Testing Group.
A collection of topic- and place-based user testing groups, primarily across North America.
Benchmarks
Benchmarks are a fairly empirical way to test technical systems by a standardized rubric. This way evaluators can test a variety of AI models, for example, using the same scoring system. Rather than become an expert in how AI works, you can refer to trustworthy and up-to-date benchmarks.
AI benchmarking resources:
Humanity's Last Exam benchmarks large language models with 2,500 questions covering a wide range of subjects.
MLCommons's AI Risk & Reliability working group to "support community development of AI risk and reliability tests and organize definition of research- and industry-standard AI safety benchmarks based on those tests."
MLCommons Agentic Reliability Evaluation Standard.
MLCommons AILuminate benchmark "to measure AI reliability across new models, languages, and tools."
In LM Arena, humans vote on chatbots' performance against one another in categories like text generation and coding.
Red-teaming
Red-teaming is a cybersecurity practice where your team or external collaborators (like "ethical hackers") try attacking your technical system to identify problems, before someone less aligned with your interests can do the same. It can be deeply technical or as simple as a social attack vector, where you look for security weaknesses in human parts of the process, as well.
Red-teaming resources:
Red Teaming Artificial Intelligence for Social Good: UNESCO's playbook
AI Vulnerability Database: An open-source, extensible knowledge base of AI failures
Red teaming large language models (LLMs) for resilience to scientific disinformation
Humane Intelligence runs "Bias Bounty" events and red-teaming competitions to evaluate AI models against standards
AI Audits
Rather than rely on AI principles or take AI developers' word for their trustworthiness, AI auditors look at the outputs of an AI model to determine whether it meets standards of values like fairness. By comparing the AI's outputs with other sources, such as demographic data, data scientists can begin to determine whether an AI model is treating everyone fairly, for example.
AI auditing resources:
Over 20 AI auditors in the Civic Tech Field Guide
Like AI itself, AI auditing is very much an emerging practice. There aren't clear standards for AI audits, and auditors themselves "can operate without quality assurance or accreditation".
Next: Running a Process on a Digital Participation Platform
Previous: Integrating in-person engagement with hybrid platforms