aj_kotval

Meta: GenAI 

for Advertisers


Roles: Conversational AI design · Product design · Content design · Cross-functional collaboration · System prompt design · Personality engineering

This role involved working on a GenAI product aimed at large business advertisers.


I was involved in various aspects of AI product development such as system prompt engineering, synthetic data creation, evals, response voice and tone styling, etc.


I'll be discussing three key challenges faced in improving the model, how these challenges were overcome, and my role in it.


Note: Pardon the lack of visuals and metrics. This product has been released to some early users. Due to its limited release and the NDA that I signed, I'm discussing this at a higher level and visuals are not part of this discussion at this time.

The team

Content Designers · Product Designers · UX Researchers · Product Managers · Engineering · Legal

Voice & Tone

Professional · Semi-formal · Efficient

Issues

— Issue:

Language

Language Issues

Problem: the model’s existing responses had several issues that were not optimal. These included:

  • - Verbosity
  • - Clutter
  • - Overly formal language
  • - Non-conversational language


The language issues in the existing model made it a difficult experience for users. Issues such as clutter and language being non-conversational meant that responses from the model were extremely stilted, overly formal (e.g.: "…consider changing the required amount to satisfy…") difficult to comprehend, and extremely lengthy for no real return on user attention investment.


Essentially, the original model was not easy to use; it tried to sound intelligent by using complex words and formal language without providing the information that users required. 


The key here was to first decide on the personality of the model. It needed to be professional, confident, and precise in its language since it was a professional tool and was not designed for frivolous chatter. It also did not need to be overly friendly since it wouldn't be used for idle chitchat. 


Once that was decided, I, along with the other two members of my content design team, created golden path conversations based on real responses from the model. These sample conversations were tested through UXR with real world users to learn what style, grade level, length, and tone were preferred. This feedback was then used to train the model on how to communicate, how much to communicate, what to communicate, and how to communicate it.

— Issue:

Inaccuracy

Inaccuracy Issues

The product needed to be accurate at all times. It was not a product designed to make people feel good, nor was it designed to keep users engaged. Its reason for existing was to provide accurate information and help users accomplish their advertising goals. Inaccurate information was a dealbreaker and would cause users to lose confidence in the product, which would affect usage.


The path to accuracy was strewn with starts and stops for our team until we were able to drill down to what was required. This involved a lot of reviewing, labeling of responses (lots and lots of labeling), and using the LLM-as-a-judge method to create a judge to identify claims and gauge their accuracy and veracity. Working with an engineer, I created system prompts for the judge to be able to discern the accuracy of model responses, as well as synthetic data that was composed of both accurate and inaccurate entries.


Alongside this, I also created rater guidelines for raters who would be conducting Human Feedback testing of the model. It was important to create guidelines that were clear, unambiguous, and easy to use since we would be using external raters for this part of the project.


— Issue:

Hallucinations

Hallucination Issues: Roleplaying

The model tended to roleplay by making promises it did not have the ability to fulfill. This was a major issue since it could lead to customer dissatisfaction, as well as financial and reputational harm to the company. 


Collaborating with a Product Manager, I identified what the model could and could not do, and any future capabilities that would need to be taken into account. 


I then worked with an engineer to create a system prompt that would evaluate responses for signs of roleplay. Multiple iterations of the system prompt were created to achieve a high degree of success in identifying roleplay. While roleplay was quite often binary (present vs. absent), there were areas that were more in-between, and these presented an interesting challenge to overcome.


While I was doing the work described above, I was also creating a guide for raters who would test the model later. As in the case of the guide for accuracy, it was important to create guidelines that were clear, unambiguous, and easy to use. In order to test the clarity of the guide, I conducted guerrilla testing by having a mixed group of people (2x CDs+2x PDs+1x Eng) do a small exercise to spot roleplay in responses using the guide. This allowed me to identify any common issues, as well as gather their input on what wasn't clear and anything that was misunderstood in the directions. This testing helped create a robust guide that could be confidently used by anyone, and could generate results that would help drive the product forward.

More to come as things get rolled out…