Generative AI is becoming the new norm, widely used and more accessible to the public via platforms like ChatGPT or Meta AI, which appear on social media platforms like WhatsApp and Instagram Messenger.
Despite its being fundamentally a transformers that break sentences into tokens and predict the next word, the implications and applications are vast. However, these GPT models currently lack human-like understanding. Which might cause reliability issues and others, but considering its capabilities the new trend of agentic AI is on rise this highlights the importance of having a well-defined testing approach.
I wanted to ask:
- What are the patterns or testing strategies you are following beyond basic testing strategies?
- What’s your approach to identify and fix, do you follow any checkmarks ?
- AI Hallucination
- Fairness and Bias
- Security & Ethical Issue
- Coherence and relevance
- Robustness and Reliability
- Explainability and Interpretability
- Include others you have Identified
Here are some of my observations:
Example 1: AI Hallucination
Issue: Generating factually incorrect or nonsensical outputs, The response provided has data that is not reliable however its sounds plausible or true.
Solution: Fact-checking, Human-in-the-loop, Prompt engineering, Training data quality, Model fine-tuning, Post-processing
Example 2: Bias and Fairness
Issue: Based on the data, Generating outputs that unfairly favor certain groups.
Solution: Bias audits, Fairness metrics, Diverse training data
Example 3: Adherence to Instructions
Issue: With tools like Meta AI Agents and similar others in Salesforce, we need to check if the response adheres to the instructions, as sometimes it fails to follow the guidelines and guardrails.
Solution: It might be an issue with the instruction, but we need to go back to basics and test against each instruction to check if it is followed or not.
This might become hectic any alternate
Example 4: Not in Coherence Knowledge Article Boundaries
Issue: GPT models used as chatbots with a set of knowledge articles sometimes provide results outside the set of knowledge articles as a reference.
Solution: Coherence metrics, Prompt design, Feedback
Example 5: Chain of Thought
Issue: In some cases, the generative AI assumes continuity with earlier conversations within the window period, which might cause unnecessary references.
Solution: There should be instructions to cross-verify and provide a note.
Most of these issues can be addressed with effective prompt engineering. However, I am curious about your methods for breaking these issues and any observations you have identified.
Source: Read More