Oncology trials fail far too frequently, and often for similar reasons. We have observed oncology studies fail frequently due to five prevalent issues: assumptions that undermined a trial’s design; inadequate patient selection criteria; problematic phase 2 trials; over-interpretation of subgroup analyses; and optimistic selection of a non-inferiority margin. This article will review these common issues in the context of specific failed trials and propose practical countermeasures that would not compromise study integrity.

Assumptions that Undermine Study Design

Let’s consider the failure of Agile 1051, a 288-patient phase 3 trial intended to demonstrate axitinib’s utility as a first-line option for metastatic renal cell carcinoma. Axitinib did not reach the developer’s ambitious goal of a 78 percent improvement in median progression-free survival (PFS), and further development of the inhibitor as a first-line therapy was abandoned.

Several assumptions in the study’s design may partially explain the trial’s failure. First, the study may have underestimated the performance of sorafenib, the control arm, as a first-line therapy. The study assumed that the comparator’s median PFS would range between 5.5 and 5.7 months. However, in more recent studies, median PFS on the comparator, sorafenib, ranged from 7.5 months to 9.1 months. Second, the sample size may have been too small to demonstrate the magnitude of the drug’s benefit. Were a more conservative estimate established and were more patients enrolled, the Agile trial may have been a positive trial.

To minimize the risk of optimistic assumptions undermining a study’s design, an adaptive design strategy can be deployed; for instance, include a  sample size re-estimation (SSR). By deploying an SSR promising zone approach, a study’s initial optimistic design assumptions can be adjusted as data accumulates. During the study, an interim analysis is performed to check conditional power against thresholds for different zones. Surpassing the threshold would trigger one of the following adaptations:

  • Stop early if there is overwhelming evidence of efficacy.
  • Stop early for futility if there is low conditional power.
  • Increase the sample size if results are promising.

Inadequate Patient Selection Criteria

Genomic profiling has revealed an extreme heterogeneity of disease drivers between patients, within a patient’s disease, and as the disease progresses. Trials of targeted oncologics that enroll “all comers” without testing for relevant genetic or molecular markers risk including irrelevant subpopulations that dilute and mask the efficacy of the drug. For example, the MARQUEE trial, which evaluated a selective MET inhibitor, tivantinib, in non-squamous, non-small cell lung cancer patients, enrolled patients regardless of MET expression despite early-phase data showing that MET inhibitors may be affected by MET protein expression, MET gene copy number, and KRAS mutations. The trial was discontinued early due to futility. Given the preexisting evidence, the developers should potentially have determined which biomarker to use before MARQEE was initiated.

Early development must address biomarkers and how to characterize relevant patient populations. If questions remain about which populations to enroll, an adaptive enrichment design can identify relevant predictive markers during a Phase 2 study. Specifically, according to a FDA draft guidance, developers can adapt entry criteria or sample size if factors can be identified that increase event rate or treatment response. The same population(s) should be studied in phase 3.

Problematic Phase 2 Trials

Three issues in phase 2 commonly lead to pivotal trial failure.

First, newer targeted compounds (e.g., molecular targeted agents, therapeutic vaccines, and immunotherapies) often poorly conform to the classical dose-toxicity and monotonic dose-efficacy models. Studies need to be designed to evaluate efficacy and toxicity outcomes correctly and simultaneously. One solution is a seamless phase 1/2 design, which jointly assesses toxicity and efficacy at various doses using methods such as the bivariate continual reassessment method. This method uses unconfirmed early responses as surrogates for the confirmed efficacy outcome and can reduce irrelevant dose selection, bias accumulation, and potentially, study duration.

Second, most phase 3 studies report lower response rates than preceding phase 2 studies, and further, phase 2 study response rates do not predict positive phase 3 studies. This may be caused by phase 2 studies’ reliance on investigator analysis of radiologic data, whereas phase 3 trials mandate central review. This discrepant practice results in responses much higher than an independent assessment board would assign. Similar to central laboratory assessments, independent central imaging is recommended in phase 2 studies, particularly if primary endpoints include PFS or time to progression (TTP).

Finally, an unrealistic “proof of concept” design may falsely suggest that a test compound is efficacious in the targeted indication. Many phase 2 studies attempt to provide “proof of concept” in highly controlled situations with select investigators and homogeneous populations, even if the patient population in a pivotal trial is unlikely to match the targeted demographics, prognoses and concomitant treatments. Consequently, phase 2 studies overestimate the therapeutic effect in both the broader and targeted patient populations, which may then lead the subsequent pivotal study to be underpowered. Thus, when making a “go/no-go” decision to move to phase 3, it is prudent to assume the true therapeutic effect in the targeted patient population is smaller than the observed phase 2 result.

Over-interpreted Subgroup Analyses

While regulatory guidelines discuss the planning and presentation of subgroup analyses, developers often still over-interpret results.

Over-interpreted subgroup analysis undermined D9902, a phase 3 trial of sipuleucel-T in patients with metastatic, asymptomatic, hormone-refractory prostate cancer. D9902 was amended after a preceding trial, D9901, failed to achieve its primary endpoint of delaying TTP for all patients, but did indicate that a subgroup of patients with a Gleason score of seven or less were responsive to sipuleucel-T. Thus, a protocol amendment to D9902 restricted the study to this subgroup and enrollment resumed. Two years later, however, an analysis of overall survival of earlier studies indicated the positive treatment effect was actually independent of the Gleason score, and D9902’s enrollment was re-opened to patients with a Gleason score greater than seven. Ultimately, D9902 failed to meet statistical significance in TTP.

To minimize the false discovery rate, e.g., below five percent, one strategy is to conduct each subgroup comparison at a low significance level, e.g., 0.001. Any findings based on subgroups should also be supported by a clinical rationale.

An Optimistic Non-Inferiority Margin

It is often not straightforward to clearly define the success criteria of non-inferiority studies.

Let’s consider a randomized phase 3 trial of irinotecan combined with cisplatin as a first-line chemotherapy for advanced non-smallcell lung cancer. The trial tested the dual hypotheses that irinotecan+cisplatin is superior to cisplatin+vindesine, and that irinotecan monotherapy is not inferior to cisplatin+vindesine. In this study, the threshold for non-inferiority of irinotecan was set where the upper limit of a 95 percent hazard ratio for overall survival against cisplatin+vindesine was 1.33. The data showed that the upper limit was 1.09, and investigators concluded that irinotecan alone was not inferior to cisplatin+vindesine.

This conclusion is questionable, however, because the defined margin is larger than the control arm’s standard treatment effect. Cisplatin+vindesine were platinum-doublet regimens in most of the first line chemotherapy trials. A meta-analysis shows the estimate of the hazard ratio of best supportive care (BSC) compared with cisplatin+vindesine was 1.3. In this trial, the non-inferiority margin was defined as 1.33, even larger than the hazard ratio of BSC. Thus, it is possible that irinotecan is no better than BSC due to the unreasonable definition of the non-inferiority margin.

Selection of non-inferiority margin should use clinical judgment of acceptable efficacy loss or the treatment difference between the standard therapy and placebo. For example, an FDA draft guidance suggested that margins can be selected to show that a fraction of the benefit of a standard therapy is preserved, often 50 percent of the active control effect.

Understanding the risks in oncology trial designs and analyses, and how to mitigate them, will equip developers to reduce the failure rate in late stage oncology trials and move effective drugs to market faster. Practical countermeasures to these risks include adopting an adaptive design and deploying more stringent statistical significance level to minimize the false discovery rate. Adaptive designs discussed include: adopting an SSR strategy when planning a phase 3 trial to validate study design assumptions, deploying biomarkers with an adaptive enrichment strategy to ensure recruitment of the most relevant patients, and using seamless phase 1/2 designs that jointly assess toxicity and efficacy at various dose levels.