This tutorial walks through how to automatically generate release notes for a repository using
the Github API and an LLM. Automatically generated release notes are tough to evaluate,
and you often don't have pre-existing benchmark data to evaluate them on.
To work around this, we'll use hill climbing to iterate on our prompt, comparing new results to previous experiments to see if we're making progress.
To see a list of dependencies, you can view the accompanying package.json file. Feel free to copy/paste snippets of this code to run in your environment, or use tslab to run the tutorial in a Jupyter notebook.
----- 86316b6622c23ef4f702289b8ada30ab50417f2d -----https://github.com/braintrustdata/braintrust-sdk/commit/86316b6622c23ef4f702289b8ada30ab50417f2d2023-11-28T06:57:57ZShow --verbose warning at the end of the error list (#50)Users were reporting that the \`--verbose\` flag is lost if it's at thebeginning of the list of errors. This change simply prints theclarification at the end (and adds it to python)----- 1ea8e1bb3de83cf0021af6488d06710aa6835d7b -----https://github.com/braintrustdata/braintrust-sdk/commit/1ea8e1bb3de83cf0021af6488d06710aa6835d7b2023-11-28T18:48:56ZBump autoevals and version----- 322aba85bbf0b75948cc97ef750d405710a8c9f1 -----https://github.com/braintrustdata/braintrust-sdk/commit/322aba85bbf0b75948cc97ef750d405710a8c9f12023-11-29T23:04:36ZSmall fixes (#51)* Change built-in examples to use Eval framework* Use \`evaluator\` instead of \`_evals[evalName]\` to access metadata. Thelatter is not set if you're running Evals directly in a script.----- ad0b18fd250e8e2b0e78f8405b4323a4abb3f7ce -----https://github.com/braintrustdata/braintrust-sdk/commit/ad0b18fd250e8e2b0e78f8405b4323a4abb3f7ce2023-11-30T17:32:02ZBump autoevals----- 98de10b6e8b44e13f65010cbf170f2b448728c46 -----https://github.com/braintrustdata/braintrust-sdk/commit/98de10b6e8b44e13f65010cbf170f2b448728c462023-12-01T17:51:31ZPython eval framework: parallelize non-async components. (#53)Fixes BRA-661----- a1032508521f4967a5d1cdf9d1330afce97b7a4e -----https://github.com/braintrustdata/braintrust-sdk/commit/a1032508521f4967a5d1cdf9d1330afce97b7a4e2023-12-01T19:59:04ZBump version----- 14599fe1d9c66e058095b318cb2c8361867eff76 -----https://github.com/braintrustdata/braintrust-sdk/commit/14599fe1d9c66e058095b318cb2c8361867eff762023-12-01T21:01:39ZBump autoevals
Next, we'll try to generate release notes using gpt-3.5-turbo and a relatively simple prompt.
We'll start by initializing an OpenAI client and wrapping it with some Braintrust instrumentation. wrapOpenAI
is initially a no-op, but later on when we use Braintrust, it will help us capture helpful debugging information about the model's performance.
import { wrapOpenAI } from "braintrust";import { OpenAI } from "openai";const client = wrapOpenAI( new OpenAI({ apiKey: process.env.OPENAI_API_KEY || "Your OpenAI API Key", }));const MODEL: string = "gpt-3.5-turbo";const SEED = 123;
import { ChatCompletionMessageParam } from "openai/resources";import { traced } from "braintrust";function serializeCommit(info: CommitInfo): string { return `SHA: ${info.sha}AUTHOR: ${info.commit.author.name} <${info.commit.author.email}>DATE: ${info.commit.author.date}MESSAGE: ${info.commit.message}`;}function generatePrompt(commits: CommitInfo[]): ChatCompletionMessageParam[] { return [ { role: "system", content: `You are an expert technical writer who generates release notes for the Braintrust SDK.You will be provided a list of commits, including their message, author, and date, and you will generatea full list of release notes, in markdown list format, across the commits. You should include the importantdetails, but if a commit is not relevant to the release notes, you can skip it.`, }, { role: "user", content: "Commits: \n" + commits.map((c) => serializeCommit(c)).join("\n\n"), }, ];}async function generateReleaseNotes(input: CommitInfo[]) { return traced( async (span) => { const response = await client.chat.completions.create({ model: MODEL, messages: generatePrompt(input), seed: SEED, }); return response.choices[0].message.content; }, { name: "generateReleaseNotes", } );}const releaseNotes = await generateReleaseNotes(firstWeek);console.log(releaseNotes);
Release Notes:- Show --verbose warning at the end of the error list (#50): - Users were reporting that the \`--verbose\` flag is lost if it's at the beginning of the list of errors. This change simply prints the clarification at the end (and adds it to python).- Small fixes (#51): - Change built-in examples to use Eval framework - Use \`evaluator\` instead of \`_evals[evalName]\` to access metadata. The latter is not set if you're running Evals directly in a script.- Python eval framework: parallelize non-async components. (#53): - Fixes BRA-661
Interesting, at a glance, it looks like the model is doing a decent job, but it's missing some key details like the version updates. Before we go any further, let's benchmark its performance
by writing an eval.
Let's start by implementing a scorer that can assess how well the new release notes capture the list of commits. To make the scoring function job's easy, we'll do a few tricks:
Use gpt-4 instead of gpt-3.5-turbo
Only present it the commit summaries, without the SHAs or author info, to reduce noise.
import { LLMClassifierFromTemplate, Scorer, Score } from "autoevals";const GRADER: string = "gpt-4";const promptTemplate = `You are a technical writer who helps assess how effectively a product team generatesrelease notes based on git commits. You will look at the commit messages and determine if the releasenotes sufficiently cover the changes.Messages:{{input}}Release Notes:{{output}}Assess the quality of the release notes by selecting one of the following options. As you think throughthe changes, list out which messages are not included in the release notes or info that is made up.a) The release notes are excellent and cover all the changes.b) The release notes capture some, but not all, of the changes.c) The release notes include changes that are not in the commit messages.d) The release notes are not useful and do not cover any changes.`;const evaluator: Scorer<any, { input: string; output: string }> = LLMClassifierFromTemplate<{ input: string }>({ name: "Comprehensiveness", promptTemplate, choiceScores: { a: 1, b: 0.5, c: 0.25, d: 0 }, useCoT: true, model: GRADER, });async function comprehensiveness({ input, output,}: { input: CommitInfo[]; output: string;}): Promise<Score> { return evaluator({ input: input.map((c) => "-----\n" + c.commit.message).join("\n\n"), output, });}await comprehensiveness({ input: firstWeek, output: releaseNotes });
{ name: 'Comprehensiveness', score: 0.5, metadata: { rationale: "The release notes cover the changes in commits 'Show --verbose warning at the end of the error list (#50)', 'Small fixes (#51)', and 'Python eval framework: parallelize non-async components. (#53)'.\n" + "The release notes do not mention the changes in the commits 'Bump autoevals and version', 'Bump autoevals', 'Bump version', and 'Bump autoevals'.\n" + 'Therefore, the release notes capture some, but not all, of the changes.', choice: 'b' }, error: undefined}
Let's also score the output's writing quality. We want to make sure the release notes are well-written, concise, and do not contain repetitive content.
const promptTemplate = `You are a technical writer who helps assess the writing quality of release notes.Release Notes:{{output}}Assess the quality of the release notes by selecting one of the following options. As you think throughthe changes, list out which messages are not included in the release notes or info that is made up.a) The release notes are clear and concise.b) The release notes are not formatted as markdown/html, but otherwise are well written.c) The release notes contain superfluous wording, for example statements like "let me know if you have any questions".d) The release notes contain repeated information.e) The release notes are off-topic to Braintrust's software and do not contain relevant information.`;const evaluator: Scorer<any, { output: string }> = LLMClassifierFromTemplate({ name: "WritingQuality", promptTemplate, choiceScores: { a: 1, b: 0.75, c: 0.5, d: 0.25, e: 0 }, useCoT: true, model: GRADER,});async function writingQuality({ output }: { output: string }): Promise<Score> { return evaluator({ output, });}await writingQuality({ output: releaseNotes });
{ name: 'WritingQuality', score: 1, metadata: { rationale: 'The release notes are formatted correctly, using markdown for code and issue references.\n' + 'There is no superfluous wording or repeated information in the release notes.\n' + 'The content of the release notes is relevant to the software and describes changes made in the update.\n' + 'Each change is explained clearly and concisely, making it easy for users to understand what has been updated or fixed.', choice: 'a' }, error: undefined}
Wow! We're doing a great job with writing quality, but scored lower on comprehensiveness.
Braintrust makes it easy to see concrete examples of the failure cases. For example this grader mentions the new lazy login behavior is missing from the release notes:
and if we click into the model's output, we can see that it's indeed missing:
Let's see if we can improve the model's performance by tweaking the prompt. Perhaps we were too eager about excluding irrelevant details in the original prompt. Let's tweak the wording to make sure it's comprehensive.
function generatePrompt(commits: CommitInfo[]): ChatCompletionMessageParam[] { return [ { role: "system", content: `You are an expert technical writer who generates release notes for the Braintrust SDK.You will be provided a list of commits, including their message, author, and date, and you will generatea full list of release notes, in markdown list format, across the commits. You should make sure to includesome information about each commit, without the commit sha, url, or author info.`, }, { role: "user", content: "Commits: \n" + commits.map((c) => serializeCommit(c)).join("\n\n"), }, ];}async function generateReleaseNotes(input: CommitInfo[]) { return traced( async (span) => { const response = await client.chat.completions.create({ model: MODEL, messages: generatePrompt(input), seed: SEED, }); return response.choices[0].message.content; }, { name: "generateReleaseNotes", } );}await generateReleaseNotes(firstWeek);
Release Notes:- Show --verbose warning at the end of the error list Users were reporting that the \`--verbose\` flag is lost if it's at the beginning of the list of errors. This change simply prints the clarification at the end (and adds it to python)- Bump autoevals and version- Small fixes - Change built-in examples to use Eval framework - Use \`evaluator\` instead of \`_evals[evalName]\` to access metadata. The latter is not set if you're running Evals directly in a script.- Bump autoevals- Python eval framework: parallelize non-async components Fixes BRA-661- Bump version- Bump autoevals
We'll use hill climbing to automatically use data from the previous experiment to compare to this one. Hill climbing is inspired by, but not exactly the same as, the term used in numerical optimization. In the context of Braintrust, hill climbing is a way to iteratively improve a model's performance by comparing new experiments to previous ones. This is especially useful when you don't have a pre-existing benchmark to evaluate against.
Both the Comprehensiveness and WritingQuality scores evaluate the output against the input, without considering a comparison point. To take advantage of hill climbing, we'll add another scorer, Summary, which will compare the output against the data from the previous experiment. To learn more about the Summary scorer, check out its prompt.
To enable hill climbing, we just need to use BaseExperiment() as the data argument to Eval(). The name argument is optional, but since we know the exact experiment to compare to, we'll specify it. If you don't specify a name, Braintrust will automatically use the most recent ancestor on your main branch or the last experiment by timestamp as the comparison point.
Let's try to address this explicitly by tweaking the prompt. We'll continue to hill climb.
import { ChatCompletionMessageParam } from "openai/resources";import { traced } from "braintrust";function generatePrompt(commits: CommitInfo[]): ChatCompletionMessageParam[] { return [ { role: "system", content: `You are an expert technical writer who generates release notes for the Braintrust SDK.You will be provided a list of commits, including their message, author, and date, and you will generatea full list of release notes, in markdown list format, across the commits. You should make sure to includesome information about each commit, without the commit sha, url, or author info. However, do not mentionversion bumps multiple times. If there are multiple version bumps, only mention the latest one.`, }, { role: "user", content: "Commits: \n" + commits.map((c) => serializeCommit(c)).join("\n\n"), }, ];}async function generateReleaseNotes(input: CommitInfo[]) { return traced( async (span) => { const response = await client.chat.completions.create({ model: MODEL, messages: generatePrompt(input), seed: SEED, }); return response.choices[0].message.content; }, { name: "generateReleaseNotes", } );}const releaseNotes = await generateReleaseNotes(firstWeek);console.log(releaseNotes);
Release Notes:- Show --verbose warning at the end of the error list (#50): Users were reporting that the \`--verbose\` flag is lost if it's at the beginning of the list of errors. This change simply prints the clarification at the end (and adds it to python).- Small fixes (#51): - Change built-in examples to use Eval framework. - Use \`evaluator\` instead of \`_evals[evalName]\` to access metadata. The latter is not set if you're running Evals directly in a script.- Python eval framework: parallelize non-async components. (#53): Fixes BRA-661.Please note that there were multiple version bumps and autoevals bumps.
Sometimes hill climbing is not a linear process. It looks like while we've improved the writing quality, we've now dropped the comprehensiveness score as well as
overall summary quality.