Testing the AI

 

Testing the AI

AI development is moving quickly, so rather than focus on where they stand at this point in time, we'll introduce a few methods you and your team can use to test the efficacy of AI models to your specific use case.

Co-design

Co-design is an important design methodology where you bring your community in early in the process of building or deploying a tech tool. By inviting others in early, you can greatly reduce the risk of building or buying something that just won't work for your community. For example, you might learn that an AI tool you were planning to use doesn't work with the language your community speaks.

Co-design Resources: 

  • The Situate AI Guidebook: Co-Designing a Toolkit to Support Multi-Stakeholder Early-stage Deliberations Around Public Sector AI Proposals

  • Intersectional Spaces of Participation: Inclusive, Resilient, Embedded (INSPIRE): "INSPIRE will test co-design of participatory spaces in seven pilots involving different groups historically marginalised in policymaking and public life."

  • OECD case studies of open co-design processes led by government agencies, including NASA.

User testing

Like co-design, user testing tools and methodologies help you see how your intended users interact with a given platform. The Citizen User Testing Group methodology pays people from your intended community to participate in the user testing program to ensure representative results.

User testing resources:

Benchmarks

Benchmarks are a fairly empirical way to test technical systems by a standardized rubric. This way evaluators can test a variety of AI models, for example, using the same scoring system. Rather than become an expert in how AI works, you can refer to trustworthy and up-to-date benchmarks.

AI benchmarking resources:

  • Humanity's Last Exam benchmarks large language models with 2,500 questions covering a wide range of subjects.

  • MLCommons's AI Risk & Reliability working group to "support community development of AI risk and reliability tests and organize definition of research- and industry-standard AI safety benchmarks based on those tests."

  • MLCommons Agentic Reliability Evaluation Standard.

  • MLCommons AILuminate benchmark "to measure AI reliability across new models, languages, and tools."

  • In LM Arena, humans vote on chatbots' performance against one another in categories like text generation and coding.

Red-teaming

Red-teaming is a cybersecurity practice where your team or external collaborators (like "ethical hackers") try attacking your technical system to identify problems, before someone less aligned with your interests can do the same. It can be deeply technical or as simple as a social attack vector, where you look for security weaknesses in human parts of the process, as well.

Red-teaming resources:

AI Audits

Rather than rely on AI principles or take AI developers' word for their trustworthiness, AI auditors look at the outputs of an AI model to determine whether it meets standards of values like fairness. By comparing the AI's outputs with other sources, such as demographic data, data scientists can begin to determine whether an AI model is treating everyone fairly, for example.

AI auditing resources:

Like AI itself, AI auditing is very much an emerging practice. There aren't clear standards for AI audits, and auditors themselves "can operate without quality assurance or accreditation".

Next: Running a Process on a Digital Participation Platform
Previous:
Integrating in-person engagement with hybrid platforms