Auto-Tuning with Cognify: The Secret to Boosting Your Gen-AI Workflow Quality by 2.8 Times with $5 in 24 Minutes - Pt. 2

The Autotuning Challenge: How to Use $5 to Cover a Search Space Worth of $168K?

In Pt. 1 of this blog series, we discussed the challenge of using $5 and 24 minutes to autotune a search space worth $168K. To recap, with four tunable steps and three cogs, each with four options, the search space is (43)4=412, and searching the entire space using brute-force can cost around $168K and several weeks. While in Pt.1, we directly presented Cognify’s optimized workflow result, in this blog post, we dive deeper into the technique behind Cognify’s efficient and effective autotuning: the AdaSeek algorithm.

Cognify’s Secret Sauce: the AdaSeek Search Algorithm

The secret behind Cognify’s outstanding results is a novel adaptive hierarchical Bayesian Optimization (BO) search algorithm called AdaSeek that works as follows:

  1. [Result-Driven] Evaluation-Based Iterative Search: AdaSeek performs workflow autotuning in iterations. Each iteration, it samples a set of cog configurations to apply to the original workflow and evaluates the updated workflow using a user-provided input dataset and evaluators.
  2. [Efficient] Bayesian-Optimization-Based Sampling: AdaSeek samples new configurations using Tree-structured Parzen Estimator (TPE), a Bayesian optimization method that picks new configurations based on past evaluation results (instead of randomly).
  3. [Coverage] Hierarchical Search: AdaSeek organizes cogs into different hierarchies based on their type (by default, the top layer being architecture cogs, the middle layer being step cogs, and the bottom layer being weight cogs). AdaSeek chooses cogs in the topmost layer first, then under each chosen configuration, it chooses the next layer’s configurations until the bottom layer. In this way, we could force each layer to sample some values, allowing better coverage in the entire search space, even when the total search budget is small.
  4. [Budget-Aware] Adaptive Search Budget Allocation: AdaSeek assigns search budgets based on how promising a configuration looks, which means bad configurations are quickly eliminated, and the search focuses only on the most effective setups.
    • If a certain workflow configuration performs well, AdaSeek focuses more resources on fine-tuning it (by exploring more cogs in the lower layers).
    • If a workflow configuration performs poorly, it abandons that path and explores better alternatives.
    • If exploration under a configuration converges (not showing much improvement over the past few rounds), it also stops exploring this path.

Now, let’s see AdaSeek in action:

Figure 1: AdaSeek Illustration. This example illustrates how AdaSeek spends a search budget of 32 iterations to efficiently navigate through the search of a simple workflow with two architecture variations. Each color represents a step cog variation (e.g., a different model or a rewritten code piece); each symbol represents a different weight (e.g., a system prompt or a set of few-shot examples). Search budgets are gradually directed to the configurations that are more fruitful (the first step cog configuration), and best results continue to be updated. The second architecture’s search is omitted.

Evaluation Results

Now, let’s look at the evaluation of Cognify with six representative gen-AI workflows (HotpotQA, Text-to-SQL, Data Visualization, Financial Analysis, Code Generation, and BIG-bench). As shown below, Cognify’s improved generation quality by up to 2.8x, cost cuts by 10x, and latency reduction by 2.7x compared to original expert-written workflows. Cognify outperforms DSPy and Microsoft Trace with up to 2.6x higher generation quality, up to 10x cost reduction, and up to 3x latency reduction.

Figure 2: Cognify’s Optimized Generation Quality and Cost/Latency. Dashed lines show the Pareto frontier (upper left is better). Cost shown as model API dollar cost for every 1000 requests. Cognify selects models from GPT-4o-mini and Llama-8B. DSPy and Trace do not support model selection and are given GPT-4o-mini for all steps. Trace results for Text-2-SQL and FinRobot have 0 quality and are not included.
Figure 3: Accuracy Comparison of Additional Workloads. Code Generation from HumanEval, Word Sorting and Objective Counting from BIG-bench. 

The Takeaway

Powered by the innovative AdaSeek algorithm, Cognify delivers fully automated, efficient workflow tuning within users’ budgets. Our result outperformed state-of-the-art tuning algorithms on representative workflows. We invite you to try out Cognify on your own workflow, and share your experience and feedback with us. Let’s make the 30-minute process one of your best decisions.

Curious to dive deeper? Check out the Cognify research paper; get Cognify from GitHub; read Cognify’s Documentation.

Don't miss these stories: