Reinforcement Learning Monitoring Feedback (RLMF) system for Aurélia AI
Aurélia AI, the AI that writes our prospecting emails. The idea is straightforward: in sales, you want to send the best possible email to your target audience. However, it’s challenging to know in advance what the best email is. For example, even if you spend hours searching the internet to craft the perfect personalized icebreaker for one specific person, the likelihood that your email ends up in the trash is still considerable (ofc, Aurélia does the job in your place, that’s the point).
I – A/B Testing
This is where A/B testing comes into play. By testing a combination of hundreds of email templates, each personalized for the target, you can determine the most effective one. In data science, the system I created is Reinforcement Learning Monitoring Feedback, which uses ponderation to determine the importance of the template email for each persona.
II – Continuous improvement
We select the best email for a particular persona and continuously improve the model. Over time, the AI will be able to send the most successful emails per persona.
III – Discovery of ORPO
Today, I discovered a paper called “ORPO” that describes a similar approach. The ORPO algorithm emphasizes the critical role of Supervised Fine-Tuning (SFT) in preference alignment, highlighting that a minor penalty for disfavored generation styles is sufficient for achieving preference-aligned SFT (1):
IV – ORPO Algorithm
ORPO introduces a novel, reference model-free approach called odds ratio preference optimization (equivalent to our ponderation), which efficiently aligns models without the need for an additional preference alignment phase. This method penalizes undesired generation styles and provides strong adaptation signals for preferred responses. ORPO’s empirical and theoretical results demonstrate its effectiveness in surpassing state-of-the-art models in various evaluations, proving its potential in combining Supervised Fine-Tuning with Reinforcement Learning techniques for optimal performance.
References
More about SFT: https://lnkd.in/emF9JmHJ
Here the paper: https://lnkd.in/eaRhNSxN
Here Aurélia AI: https://lnkd.in/exh9Y6S7