aj_kotval

Meta: GenAI 

for Advertisers


Roles: Conversational AI design · Product design · Content design · Cross-functional collaboration · System prompt design · Personality engineering

This role involved working on a GenAI product aimed at large business advertisers.


I was involved in various aspects of AI product design and development such as system prompt engineering, synthetic data creation, evals, response voice and tone styling, etc.


I'll be discussing three key challenges faced in improving the model, how these challenges were overcome, and my role.


Note: System prompts are not discussed since they are Meta property and protected by NDAs.

The team

Conversation Designer · Content Designers · Product Designers · UX Researchers · Product Managers · Engineering · Legal

Personality

The AI assistant needed to be a professional assistant. It was not designed for engagement and would not be used for things like companionship, emotional discussions, etc. Its job was to serve as an advertising assistant, and it had to do that well.


In keeping with this, I conducted research and found these to be the most effect key personality traits:

- Courteous

- Semi-friendly

- Competent

- Concise


Each of these would help portray an image of professional competence, intelligence, and of being cognizant of the user’s need for efficiency. All golden path conversations involved these key traits to train the model to respond using these traits. 

Issues

— Issue:

Language

Language Issues

Problem: the model’s existing responses had several issues that were not optimal. These included:

  • - Verbosity
  • - Clutter
  • - Overly formal language
  • - Non-conversational language


The language issues in the existing model made it a difficult experience for users. Issues such as clutter and language being non-conversational meant that responses from the model were extremely stilted, overly formal (e.g.: "…consider changing the required amount to satisfy…") difficult to comprehend, and extremely lengthy for no real return on user attention investment.


Essentially, the original model was not easy to use; it tried to sound intelligent by using complex words and formal language without providing the information that users required. 


The key here was to first decide on the personality of the model. It needed to be professional, confident, and precise in its language since it was a professional tool and was not designed for frivolous chatter. It also did not need to be overly friendly since it wouldn't be used for idle chitchat. 


Once that was decided, I, along with the other two members of my content design team, created golden path conversations based on real responses from the model. These sample conversations were tested through UXR with real world users to learn what style, grade level, length, and tone were preferred. This feedback was then used to train the model on how to communicate, how much to communicate, what to communicate, and how to communicate it.

Left: Example of aforementioned verbosity and clutter.  

Right: After being trained to be relevant and to avoid verbosity.

Left: An example of a response when user needs instructions. Note the cluttered nature of the response.

Right: Same response after the model was trained.

— Issue:

Inaccuracy

Inaccuracy Issues

The product needed to be accurate at all times. It was not a product designed to make people feel good, nor was it designed to keep users engaged. Its reason for existing was to provide accurate information and help users accomplish their advertising goals. Inaccurate information was a dealbreaker and would cause users to lose confidence in the product, which would affect usage.


The path to accuracy was strewn with starts and stops for our team until we were able to drill down to what was required. This involved a lot of reviewing, labeling of responses (lots and lots of labeling), and using the LLM-as-a-judge method to create a judge to identify claims and gauge their accuracy and veracity. Working with an engineer, I created system prompts for the judge to be able to discern the accuracy of model responses, as well as synthetic data that was composed of both accurate and inaccurate entries.


Alongside this, I also created rater guidelines for raters who would be conducting Human Feedback testing of the model. It was important to create guidelines that were clear, unambiguous, and easy to use since we would be using external raters for this part of the project.


— Issue:

Hallucinations

Hallucination Issues: Roleplaying

The model tended to roleplay by making promises it could not fulfill. This was a major issue since it could lead to customer dissatisfaction, as well as financial and reputational harm to the company. 


Collaborating with a Product Manager, I identified what the model could and could not do, and any future capabilities that would need to be taken into account. 


I then worked with an engineer to create a system prompt for a judge that would evaluate responses for signs of roleplay. Multiple iterations of the system prompt were created to achieve a high degree of success in identifying roleplay. While roleplay was quite often binary (present vs. absent), there were areas that were more in-between, and these presented an interesting challenge to overcome.

Left: Model roleplaying ability to refund customer. 

Right: Post-training with model not roleplaying.

While I was doing the above, I was also creating a guide for raters who would test the model. Just like in the case of the guide for accuracy, it was important to create guidelines that were clear, unambiguous, and easy to use


In order to test the clarity of the guide, I conducted guerrilla testing by having a mixed group of people (2x CDs+2x PDs+1x Eng) do a small exercise to spot roleplay in responses using the guide. This allowed me to identify any common issues with the rating guidelines, as well as gather their input on what wasn't clear and anything that was misunderstood in the directions. 


This testing helped create a robust guide that could be confidently used by anyone, and could generate results that would help drive the product forward.

Final Released Product