Getting More from Your Test-Time Compute Budget with Portfolio Beam Search
Authors: Dan Elbaz, Oren Pereg, Ronen Laperdon, Daniel Korat, Oren Salzman
TLDR: Portfolio Beam Search (PBS) is a test-time method that integrates financial portfolio theory into LLM inference. Instead of greedily picking the highest-scoring candidate solutions, it hedges its "bets" by treating candidate solutions like financial assets in a portfolio, evaluating them based on their risk-adjusted potential. By diversifying the reasoning process, PBS avoids "reasoning ruts" and optimizes the compute budget to achieve superior accuracy and more robust reasoning.
Introduction
The AI scaling paradigm is undergoing a fundamental shift. As massive-scale pretraining encounters diminishing returns and astronomical costs, the industry is pivoting toward a more resource-efficient frontier where "thinking longer" becomes a viable alternative to "growing larger." This evolution marks a departure from a pure reliance on train-time compute—the traditional brute-force scaling of parameters toward test-time compute scaling (TTC), which strategically scales a model’s processing time during the inference phase to solve complex tasks[1].
This transition is discussed in depth in the research Scaling test-time compute with open models[2], which demonstrates how dynamic inference strategies can close the performance gap. By leveraging methods like Best-of-N sampling and beam search guided by Process Reward Models (PRMs), smaller, optimized models can rival and even outperform their much larger counterparts. These strategies move beyond simple majority voting by using step-level verification to navigate the reasoning space more effectively.
However, simply increasing the compute budget does not guarantee better results. Even advanced search methods can struggle with "tree collapse," where a model fixates on paths that appear promising based on the early cumulative rewards, prematurely abandoning the exploration of better alternatives. To fully realize the potential of TTC, efficient decoding must go beyond simple renking based on PRM scores. In this post, we introduce an adaptation of Portfolio Beam Search (PBS)[3] for LLMs. By framing decoding as an optimization problem, PBS balances expected output quality against model uncertainty and semantic diversity. This ensures that "thinking longer" remains productive—rather than redundant—even when the model encounters distributional shifts or unfamiliar reasoning paths.
Searching for Better Answers- Investing in Inference
This post advocates for search-based decoding as a practical framework for scaling test-time compute in LLMs. Rather than relying on a single inference pass, these methods systematically explore the solution space by generating multiple candidates and utilizing a verifier to select the most promising paths. We focus on Process Reward Models (PRMs) [4] as the primary engine for this exploration. By providing fine-grained, step-by-step feedback—rather than a sparse outcome-based reward—PRMs serve as a natural fit for steering complex reasoning via guided search.
Standard decoding methods typically operate as a greedy process, prioritizing the "best" immediate step by identifying the path with the highest cumulative reward. However, this approach often suffers from mode collapse, where candidate solutions become mere syntactic variations of a single underlying logic. Furthermore, by blindly maximizing reward model outputs, greedy strategies ignore the inherent uncertainty in a verifier’s predictions. To address these limitations, we took a page out of finance and applied Modern Portfolio Theory [5] to the decoding process, introducing our decoding algorithm: Portfolio Beam Search (PBS). In this framework, candidate trajectories represent assets, while computational effort serves as the finite capital to be allocated across the search space.
While PBS maintains a set of partial solutions—similar to other sampling-based variants—it fundamentally redefines the objective of candidate selection. Rather than exhausting the computational budget on the highest-scoring paths, PBS treats decoding as a portfolio-optimization problem. At each step, the algorithm strategically "invests" its compute by balancing estimated cumulative reward against the risk due to verifier uncertainty and the structural diversity required to maintain unique solution paths. By optimizing across these competing dimensions, PBS achieves a superior trade-off between exploration and exploitation, yielding results that are both high-quality and robustly diverse. This concept perfectly captured by a quote associated with Warren Buffett[6]:
There is nothing wrong with a ‘know nothing’ investor who realizes it. The problem is when you are a ‘know nothing’ investor, but you think you know something.
The Basics of Portfolio-Optimization: Maximizing Reward, Minimizing Risk
Every investor has two goals:
- Maximize the money they make (Reward).
- Minimize the chance of losing money (Risk).
In finance, a portfolio is simply a collection of investments (like stocks, bonds, or real estate). Portfolio-Optimization is the process of finding the sweet spot—the absolute best mix of investments (stocks, bonds, etc.)—to achieve the highest possible return for the lowest possible risk. The goal is to find the optimal set of weights (w) for the assets that balance these two factors based on the investor's risk tolerance. This approach was formalized by Harry Markowitz [1952] who introduced this risk-expected return relationship in the mean-variance model for portfolio selection, for which he was awarded a Nobel Prize in economics.
Formally, to construct a portfolio, given N different assets and a given amount of wealth w, we need to decide the allocation of wealth to each asset. This involves determining a weight vector for the N assets, denoted as , where all the elements in w are positive and their sum equals one. Specifically, given some risk-aversion parameter δ, assets mean vector and covariance matrix , solving the portfolio- optimization problem is equivalent to computing a weight vector w dictating the relative amount to invest in each asset in the following portfolio-optimization problem:
Here:
- (1) Expected Return: The first term represents the expected portfolio return.
- (2) Risk Penalty: The second term is the risk penalty (variance), controlled by the risk aversion parameter δ.
Method: Portfolio Beam Search with Process Reward Models
Portfolio Beam Search (PBS) is a structured decoding framework that applies the principles of Modern Portfolio Theory to the search process. By systematically balancing exploration and exploitation, PBS provides a robust mechanism for enhancing model outputs at test-time. While it shares the fundamental mechanics of sampling-based Beam Search—aiming to extract a set of optimal partial solutions—it diverges in its selection criteria by identifying and retaining candidates based on optimized resource investment rather than raw PRM score alone.
The core of the PBS framework treats decoding as a dynamic resource allocation problem, solving a sequence of portfolio-optimization problems at every intermediate step. This approach fundamentally redefines the traditional investment landscape where the total available computational capacity, or beam width, serves as the budget, and the various candidate sequences currently being explored by the model represent the assets. To bridge the gap between financial theory and LLM inference, financial variables are mapped directly onto the mechanics of sequence generation. This translation relies on a precise definition of the expected return vector , which represents the predicted quality of the candidate solution, and the risk matrix , which captures the uncertainty and redundancy among candidate solutions. Both components are specifically tailored for probabilistic text generation, as detailed in the following section.
Constructing the Portfolio: From LLM Outputs to Financial Variables
To apply Portfolio-Optimization to decoding, we must first translate the raw outputs of our language model and PRM into the language of finance. Specifically, we need to construct a Mean Vector and a Risk Matrix .
The Mean Vector : Measuring Expected Return:
In financial terms, we want to invest in assets with the highest expected return. In PBS, the "return" is the predicted quality of a reasoning path. We use a PRM to provide a sequence of scores for each of the reasoning steps of a candidate solution. For any given candidate solution , the cumulative score is calculated as the discounted sum of rewards: This follows the methodology established in the original TTC HuggingFace research, where the reward signal is treated as the ground truth for a candidate's quality.The Risk Matrix : Modeling Uncertainty and Redundancy:
While the score tells us how "good" a sequence is, the Risk Matrix tells us how "reliable" and "unique" it is. This is where PBS truly moves the needle, shifting the objective from merely identifying the highest-scoring sequence to discovering the most robust and diverse set of candidates. To achieve this, we apply a standard statistical decomposition to the risk matrix, separating individual uncertainty from inter-candidate similarity: The Uncertainty Matrix, , acts as a scaling factor representing the uncertaintey of each candidate solution, while the Similarity Matrix, , encodes the correlations between them. By isolating these components, we can explicitly manage the trade-off between individual confidence and collective redundancy. The following sections detail the construction and role of each of these matrices within our specific domain.The Uncertainty Matrix : Hedging Against Distributional Shift: The Uncertainty Matrix, , is a diagonal matrix where each entry represents the individual uncertainty of a candidate solution. Much like the standard sum of rewards, we calculate this value by aggregating the uncertainty encountered at each step of the reasoning path: . Here, represents the uncertainty of the candidate at the step. This formulation allows the framework to account for the temporal accumulation of risk. Here, denotes a proxy for Out-of-Distribution (OOD) detector, identifying when the PRM encounters reasoning steps that deviate from its training data and are consequently more prone to error.
In our implementation, we define as the variance of the PRM’s softmax output, capitalizing on the observation that well-calibrated Transformers exhibit higher variance and lower confidence when they are unsure.
Note: While we use variance for its computational efficiency, we view this as a flexible design choice; more sophisticated epistemic uncertainty estimators (such as Monte Carlo Dropout or Deep Ensembles) can be integrated without altering the underlying framework.
By treating OOD candidates as a financial risk, the PBS framework can mathematically down-weight paths where the model is essentially "guessing." This ensures that the computational budget is not wasted on high-scoring sequences that are actually the result of the model struggling with an unfamiliar path. In essence, acts as a vital safety mechanism—shifting the "investment" away from brittle solutions toward solutions where the reward signal is both high and with low uncertainty.
The Similarity Matrix : Diversifying Candidate Solutions: The second component of our risk decomposition is the Similarity Matrix, . This is where the "diversity" of the PBS framework is realized. While the matrix handles the unceratinty of individual candidate solution, captures the semantic relationships between every pair of candidates in the beam. This approach transforms the decoding process from a greedy search for the highest individual scores into a strategic allocation of the "computational budget."
To populate this matrix, we utilize the Sentence Transformers framework[7]. Specifically, we employ Cross-Encoders (also known as rerankers) to calculate a semantic similarity score for each pair of text sequences. This allows the model to mathematically recognize when two different reasoning paths are conceptually identical.
By including the Similarity Matrix in our risk calculation, PBS prevents the beam from collapsing into a single, repetitive line of thought where the search space is dominated by nearly identical sequences and instead forces the model to allocate its computational budget across a diverse set of "assets," ensuring the exploration of independent reasoning paths and significantly increasing the probability of discovering the correct solution.
Putting it All Together: The PBS Decoding Framework
With the Mean Vector and Risk Matrix defined for our LLM-generated candidates, the heavy lifting is done. We can now apply the PBS framework during inference.
At each step of the we generate candidate solutions, PBS then determine the optimal weights —representing the computational "investment" in each candidate solution by maximizing the risk-adjusted return: . In this equation, the parameter acts as our "risk-aversion" dial. A higher forces the model to prioritize diversity and stability, even if it means sacrificing a bit of raw PRM score.
By solving this optimization problem dynamically, PBS supersedes simple search heuristics. The mathematical tension between the Mean Vector , the uncertainty matrix and the similarity matrix ensures that we aren't just betting on the loudest voice in the room—paths that may have high raw scores, but on the most reliable and diverse set of perspectives the model can offer. The overall algorithm is given in the pseudocode provided below:
Algorithm: Portfolio Beam Search (PBS)
Input: Initial candidates , expansion factor , risk aversion .
Output: Optimized sequence(s).
# Initial Phase: Sample N independent candidates
candidates = initialize_beam(size=N)
until EOS_TOKEN_REACHED or MAX_DEPTH_EXCEEDED:
# 1. Map LLM outputs to financial variables
# Returns the expected quality vector and the risk matrix
mu, Sigma = compute_mean_and_risk(candidates)
# 2. Solve for optimal resource allocation
# Objective: max (wᵀμ - λwᵀΣw) subject to sum(w) = 1
weights = solve_portfolio(mu, Sigma, risk_aversion=lambda)
# 3. Prune candidates based on 'investment' weights
# Select the top K candidates where K = N/M
current_beam = select_top_k(candidates, weights, k=N/M)
# 4. Expand selected candidates
# Sample M next-step continuations for each candidate in the current_beam
candidates = expand_beam(current_beam, expansion_factor=M)
Results on MATH-500: Benchmarking PBS Performance
To evaluate the impact of PBS on scaling test-time compute, we replicated the experimental framework established in the original Hugging Face study. We used Llama-3.2-1B-Instruct as our primary LLM for fenerating the solutions; its 1B-parameter scale provides an "unsaturated" performance profile on mathematical benchmarks, making it an ideal candidate for demonstrating gains through compute scaling. Search guidance was provided by Llama3.1-8B-PRM-Deepseek-Data, an 8B PRM that offers granular, step-by-step feedback rather than relying solely on final-answer verification.
We conducted our evaluation on the MATH-500 benchmark, a subset of the MATH benchmark[9] popularized by OpenAI’s research on process supervision. This collection consists of 500 problems spanning seven core subjects that are notably challenging for both human experts and standard Large Language Models. To maintain consistency with established research, our experiments followed the hyperparameter configurations defined by DeepMind[8]. Each strategy was tested across a compute budget of generations per prompt using a fixed beam width of . We sampled with a temperature of to diversify the initial candidates and allowed for up to 40 iterations, resulting in a reasoning tree with a maximum depth of 40 steps.
To account for stochasticity and evaluate result stability, we executed the pipeline across three random seeds. In the accompanying figures, solid lines denote the mean performance, while shaded areas represent the min-max range across all experimental trials.
Our evaluation of PBS demonstrates a substantial improvement in sample efficiency compared to existing search baselines. Most notably, PBS achieves a mean accuracy at that matches the performance of both Diverse Verifier Tree Search (DVTS)—a HuggingFace-developed extension of beam search that uses a PRM to guide independent subtrees—and standard Beam Search at , effectively doubling the compute efficiency. The scaling advantage becomes even more pronounced when compared against standard baselines; at a budget of only 32 generations, PBS surpasses the performance levels typically requiring 256 generations in competitive methods. By bridging this 8x gap in search budget, PBS enables a 1B parameter model to achieve nearly 55% accuracy—rivaling significantly larger architectures. While our current evaluation was capped at due to the high computational demand of these experiments, the scaling trends remain incredibly promising. We are currently expanding our research to include larger beam widths and a broader range of datasets to see exactly how far this process-guided approach can scale. 💪 Stay tuned for the next update!
Conclusion
Recent research into scaling test-time compute highlights the role of search is the cornerstone of efficient LLM reasoning. In this post, we introduced an adaptation of PBS for Large Language Models. While traditional search methods often suffer from "tree collapse"—where a single high-reward path prematurely dominates and limits exploration—PBS treats compute allocation as an optimization problem that accounts for both generation history and model uncertainty. Inspired by Modern Portfolio Theory, PBS views every candidate reasoning path as a financial asset. Rather than simply betting on the highest-scoring candidates from a PRM, PBS solves a portfolio-optimization problem at each step. It strategically invests its compute budget across a diverse set of candidates by mathematically balancing expected return, risk (reward model uncertainty), and semantic similarity. This approach allows PBS to outperform established baselines like Best-of-N vanilla Beam Search and DVTS, enabling a 1B parameter model to achieve accuracy levels typically reserved for much larger architectures.
While these early results demonstrate significant efficiency gains, the behavior of PBS at higher compute scales and across more diverse domains remains an open frontier. We are currently conducting extensive experiments with expanded search budgets and broader benchmarks to determine the true scaling limits of this approach. We look forward to sharing these findings and their implications 🚀.
References
[1] OpenAI. (2024). Learning to Reason with LLMs
[2] Beeching et al. (2024). Scaling test-time compute with open models
[3] Elbaz et al. (2024) Portfolio Beam Search: Diverse Transformer Decoding for Offline Reinforcement Learning Using Financial Algorithmic Approaches.
[4] Uesato et al. (2022). Solving math word problems with process- and outcome-based feedback.
[5] Markowitz, H. (1952). [Portfolio selection.]
[6] J. Goodman. (2018). The Best Quotes Book: 555 Daily Inspirational and Motivational Quotations by Famous People. Lulu.
[7] Reimers and Gurevych. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks.
[8] Snell et al. (2024). Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters.
[9] Hendrycks et al. (2021). Measuring Mathematical Problem Solving With the MATH Dataset

