I worked on a GenAI product designed for large business advertisers, tackling everything from system prompt engineering and synthetic data creation to evaluations and shaping the model’s voice and tone.
I’ll discuss three major challenges faced in improving the model, the solutions, and the impact of my contributions.
· Conversational AI design
· System prompt design
· Personality engineering
· Product design
· Content design
email@domain.com
000-000-000
Defining the AI's personality
The AI assistant lacked a unified communication style. Its responses fluctuated between excessively verbose and robotic, depending on the input. This inconsistency weakened user trust and made the assistant feel unreliable, especially in a professional context where precision, tone, and efficiency directly influence user perception.
The goal was to define a consistent personality framework that would guide every aspect of the model’s language and behavior, aligning it with its purpose as a professional advertising assistant.
Process
To build this framework, I conducted a comparative review of professional assistant personas across multiple AI products. The focus was on identifying the traits that projected intelligence and reliability without being cold or distant.
Through this research, I identified four core traits that would anchor the assistant’s communication style:
These traits became the foundation for golden path conversations – sample dialogs representing ideal model behavior. I wrote and refined these with two other content designers, testing variations in tone, sentence length, and formality through UX research sessions with real users.
The testing focused on four key measures:
The data from these sessions directly informed prompt and model tuning decisions, ensuring that tone, length, and lexical choices consistently matched user expectations for a professional assistant.
The final output was a personality specification embedded in the system prompt and supported by content guidelines used across all response types. Every model response was required to reflect these traits, and golden path examples were used as benchmarks for QA and fine-tuning.
This created a shared language standard across design, product, and engineering which would reduce subjective interpretation during iteration and evaluation.
Issue #1:
Response
issues
The model’s responses were verbose, cluttered, and overly formal, often using language that sounded intelligent but failed to communicate efficiently. Phrases like “please consider adjusting the required parameters to achieve the desired outcome” made simple tasks feel dense and bureaucratic.
This over-formality created a poor user experience where users had to parse long, complex sentences to extract basic information. The communication style projected effort instead of expertise. It slowed workflows, caused cognitive fatigue, and made the assistant feel mechanical rather than capable.
The challenge was clear: How do we make the AI sound intelligent but not academic, professional but human-readable?
Process
Once the assistant’s personality traits (courteous, semi-friendly, competent, concise) were defined, I led a focused effort with two content designers to rebuild the assistant’s communication model.
We began by writing golden path conversations – ideal interactions that demonstrated how the assistant should communicate across a variety of real user scenarios. These examples helped us capture the right balance of tone, precision, and brevity before writing prompts.
To validate the language direction, we partnered with UX research to test responses with real representative users to determine:
We tested variations across tone, sentence length, and structural complexity, collecting both quantitative ratings and qualitative feedback.
Solution
Insights from testing directly informed the refinement of system prompts and training data.
The result was a communication framework that produced responses that were concise, confident, and context-aware, mirroring how a skilled human professional would respond under time pressure.
Impact
In essence, the assistant moved from sounding like a document to sounding like a colleague who knows exactly what you need.
Issue #2:
Inaccuracy
Accuracy was mission-critical. Unlike a casual assistant or entertainment chatbot, this product existed to help users make real advertising decisions – where incorrect data or misleading information could directly impact business outcomes, financial results, and user trust.
Early tests revealed significant issues: the model sometimes fabricated metrics, misstated campaign details, or overconfidently presented guesses as facts. These failures weren’t just usability concerns, they were also credibility risks.
The design challenge was to create a system where accuracy was measurable, improvable, and sustainable, without slowing iteration or adding unnecessary manual overhead.
Process
Our approach combined automation, human feedback, and prompt design to systematically drive accuracy improvements.
Solution
By merging automated judgment (LLM-as-a-judge) with structured human evaluation, we built a feedback loop that continuously identified and corrected factual weaknesses in the model.
This ecosystem became our accuracy evaluation framework – a repeatable process that could scale across new product areas and model versions.
Impact
Ultimately, this work ensured the assistant could be trusted as a professional-grade AI partner – one that users could rely on for decision-making, not just conversation.
Issue #3:
Hallucinations
The model hallucinated quite often, specifically through roleplay, where it fabricated capabilities or made commitments it couldn’t fulfill. Examples included phrases like “I’ll schedule that for you” or “I’ll make those changes now,” despite the assistant having no such functionality.
This wasn’t a minor UX flaw; it represented a high-risk behavior. Potential consequences of hallucinations could include:
The challenge was to detect, prevent, and eliminate roleplay behaviors not only in clear-cut cases, but in subtle, borderline examples where the model’s intent or tone implied capabilities it didn’t have.
Process
Solution
The combination of judge model evaluation and human rater validation created a two-tier detection system that filtered out hallucinations early in the development process.
This layered approach created an adaptable system that could be applied to new datasets, product surfaces, and future model versions.
Impact
In short, we built not just a safer model, but a truthful one that was grounded in capability, transparent about limitation, and consistently aligned with user expectations.