Department of Medicine
School of Medicine Queen's University

News, Innovations and Discoveries Blog

The flaw in ourselves: a Shakespearean guide to improve preclinical research

The fault, dear Brutus, is not in our stars,
 but in ourselves, that we are underlings.

Julius Caesar, Act 1, scene 2, 135–141

You encounter a common headline, “Doctors have discovered a new cancer cure”. You read on and discover the work was performed in cells and in mice. Your heart sinks. Cancer has been cured in mice for decades; however, these successes in preclinical (animal) models have often failed to translate into effective therapies for patients.  Indeed, reviews of the value of preclinical research in many human diseases, including sepsis, have concluded that the results of animal studies had largely been unhelpful, or at least not very predictive of success when the therapy was moved from the bench (cells/rodents) to the bedside (patients).

What is being lost in translation? Is there a fundamental flaw in the preclinical models of human disease? Species differences contribute to the challenges of translational research. For example, mice are remarkably resistant to the toxic effects of the bacterial lipopolysaccharide (LPS) that mediates sepsis, whereas humans are exquisitely sensitive. However, rodents and humans have much in common, both on the molecular and anatomical levels. Moreover, drugs that work in humans to treat our disease also tend to work in preclinical animal models. So why do positive findings in rodents so often fail to lead to effective human therapy? I will make the case that the models themselves are not flawed (or at least not fatally flawed); rather it is poor trial design that leads to false positive findings. To explore this idea, I turned to the Bard, Shakespeare.  His quotes offer guidance to assist in the design of rigorous preclinical trials, which are much more likely to predict results in humans.

When it comes to preclinical research, the flaw is in ourselves, not in our stars! This is a paraphrase of famous quote by Cassius, who is counselling Brutus about the need to depose Julius Cesar, who in Cassius’ view was a tyrant who strode unopposed across the land. Cassius was making the point that the people should not blame fate (their stars) for Cesar’s reign of hegemony; rather the flaw was in the people themselves, for they behaved without courage or vision (i.e. as underlings). Tempting as it is to spin the blog Trump-ward let me continue on an apply this prose to the basis for false positive animal studies.

Let’s begin with a multiple-choice question:

What is the greatest flaw in pre-clinical research?

a)       Animal models do not recapitulate human disease?


b)       Preclinical trial design is not sufficiently robust

The best answer is B. When it comes to preclinical science that does not translate to the human arena the flaw (largely) is in ourselves-we scientists who fail to design and analyze preclinical trials with the rigor required to yield reliable and reproducible findings.

I use preclinical studies of pulmonary arterial hypertension (PAH) to illustrate common and avoidable flaws in our experimental approach. Pulmonary hypertension (PH) is defined simply as a resting mean pulmonary artery pressure >25mHg. There are 5 Groups of PH. This blog focuses on pulmonary arterial hypertension (PAH) (Group 1 PH), a type of PH in which the disease primarily affects the small arteries and veins in the lung circulation. PAH affects women 4x as often as men. PAH often leads to death from failure of the pressure-overloaded right ventricle (RV). PAH is defined as resting mPAP ≥25 mmHg, pulmonary capillary wedge pressure (PCWP) <15 mmHg and PVR >3 Wood Units(18)). To diagnose PAH, one has to exclude more prevalent causes of PH, notably left heart disease, COPD, and venous thromboembolism.  In PAH, the pulmonary vasculature is dynamically obstructed by vasoconstriction, structurally obstructed by adverse vascular remodeling, and pathologically noncompliant, due to vascular fibrosis. PAH-targeted therapies (prostaglandins, phosphodiesterase-5 inhibitors, endothelin receptor antagonists and soluble guanylate cyclase stimulators), used alone or in combination, improve functional capacity and hemodynamics, and reduce hospitalization. However, these vasodilators have not shown to reduce mortality, which remains ~50% at 5-years. Three conclusions emerge from this introduction. First, new therapies are needed (thus a rationale for preclinical studies). Second, these preclinical studies must be performed in relevant disease models that reproduce key aspects of human PAH (vascular obstruction, RV dysfunction) and third, the experimental design should include measurements of the parameters that define PAH, such as mPAP and PVR, and determine prognosis (RV function). The table below summarizes some of the common, avoidable faults in preclinical research.

Common areas of weakness in preclinical studies of pulmonary arterial hypertension (PAH):

So, let’s turn to the Bard for advice on building a better preclinical rodent study.

Elements of good design: Trial design must be rigorous with attention to the risks that come from inadequate sample size, unconscious bias and (rarely) malfeasance. In Science, it is key to be true to one’s self by acknowledging in our trial design that we are all susceptible to unconscious bias and influence by the belief in the beauty of our ideas. To be true to ourselves (and thus avoid being false to any person) we must therefore design studies that have adequate sample size to ensure effects observed are unlikely to be due to chance. Oddly in many high impact journals one sees t-tests and ANOVAs applied to sample sizes of <5/group. These types of statistical analyses depend on a normal distribution (which can be measured). One cannot apply normal statistical analysis to samples with <5 data points/group. Since PAH therapies are used chronically this mandates long term use of preclinical therapies in preclinical models. Thus, study duration needs to be sufficiently long to assess the intervention in a way that is relevant to human therapy.

We also have to ensure blinding (i.e. that the scientist administering treatment and measuring outcomes data is unaware of the treatment the rodent is receiving-drug vs placebo). Many rewards come to scientists for publishing and thus subconscious incentives to find a positive result (which is more likely to lead to publication) must be acknowledged. Thus, employing a rigorous placebo-control design with blinded acquisition of results is crucial to insure the integrity of the scientific literature.

Although the 3Rs of animal research (replacement, reduction and refinement) mandate we use as few rodents as possible to address a scientific question, the ultimate waste of an animal is to perform science in a manner that yields unreliable results. While ensuring we have adequate sample size (neither too small nor too large) we can also increase the validity of the findings by replicating the study in more than one model of PAH and by correlating findings in rodent studies with results in human tissues.

The use of a multicenter trial design is an exciting, emerging way to avoid investigator bias and enhance external validity of a preclinical discovery. The multicenter trial is a standard in human research. Just as in human studies, multicenter preclinical trials require a team of investigators at several sites to perform identical studies using a standardized animal model and common protocol.  This allows the team to answer the question while revealing inter-laboratory variance, which reflects on the robustness of the findings. Presumably a finding that is robust will be reproducible by the entire group of dispersed research labs. Dr Manoj Lalu (University of Ottawa) recently visited Queen’s University and presented an early example of such a randomized, multicenter, preclinical trial that employed a single common protocol. They evaluated the reproducibility of findings in a model of sepsis, created by injecting mice with endotoxin (lipopolysaccharide, LPS). This proof of design trial focused on reproducibility of measurements of cell count and cytokines levels in bronchoalveolar lavage fluid-BAL) (Figure below).

They used the protocol shown on the left (below). One can see that despite careful standardization they found moderate variability in the results obtained in their network of high-quality labs (note red box).

More common use of this practice would enhance our confidence in the reliability and reproducibility of preclinical studies and ease the financial burden required of a single lab to generate studies with adequate sample size to be definitive.

What next from the Bard of Avon?

What glitters in Science is publication. It is much easier to publish positive results. This publication bias, exacerbated by the higher cost of a rigorous, well-designed trial and the greater skill required to analyze meaningful endpoints, such as right heart catheterization, creates a bias to accept short and superficial studies. Studies that employ surrogate endpoints (for example in rat studies weight loss of >10% and lack of grooming) and which are too brief are biased toward finding survival benefit with minimally efficacious intervention. If the requirement of the animal studies committee is to stop the trial when an animal loses 10% of its weight and stops grooming this may be commendable ethically BUT this cannot be counted as “death” in an endpoint analysis. Even in human trials in PAH we see this obfuscation when large studies are calling a composite endpoint “mortality” and claiming benefit in reducing “mortality” (even though all benefits of the study drug relate to reduction in softer adverse outcomes–like hospitalization).  To quibble with Shakespeare’s Juliet: A Rose by any other name would NOT smell so sweet! For example, Macitentan a drug approved to treat PAH reduced a composite endpoint called “mortality” by 45%.  However, this benefit was driven exclusively by the reduction in PAH worsening, without reduction in all-cause or PAH-related mortality…we should be true to ourselves in preclinical and clinical research studies.

In an interesting example of both the importance of study duration and the bias toward higher citation of positive studies let’s consider the studies that examined the use of statins as a therapy for PAH. Statins are cholesterol-lowering drugs that are beneficial in the primary and secondary prevention of cardiovascular disease. They were proposed to have potential to treat PAH. In an initial study of short duration (4-weeks of therapy) using a valid preclinical model (the rat monocrotaline-pneumothorax model) simvastatin completely eliminated mortality. This was a very impressive result, since monocrotaline usually causes lethal PAH (animals die from RV failure). The paper was cited 311 times (which is a lot of attention) (Figure below).

In a study using a longer duration of therapy from my lab no benefit of statin therapy was found. After an initial slowing of mortality all rats still died by 8 weeks.  We interpreted this as evidence that statins were not beneficial. This negative study was not highly cited.

So, what happened when statins were tested in humans?

The result was a finding that statins had no benefit. Interestingly the study is not as highly cited as a positive study in a preclinical PAH model.  We need to avoid the glitter, both in experimental design and in citation, by ensuring our preclinical therapies are of sufficient duration to be translatable to the human condition where use is protracted. This is not a criticism of the first rodent paper but a reminder that multicenter trials, larger sample size and longer study duration with more robust hemodynamics might have yielded the correct, negative answer earlier (and avoided a human study altogether).

Does master Shakespeare have further advice?  Yes indeed, and for it we turn to the Merchant of Venice, Act 2, Scene 6.

The pretty folly we most often commit in preclinical studies of PAH is to fail to properly measure hemodynamics (mean pulmonary artery pressure, left atrial pressure, systemic blood pressure and cardiac output). This is a serious flaw in a disease that has a hemodynamic definition (as noted initially mPAP>25, PVR>3 Wood units and wedge pressure <15). This is especially relevant in genetically modified mice, since many conditions of the lung and left heart (that may be unrecognized) can cause pulmonary hypertension (elevated mPAP) but not reflect PAH!

Too often studies are published in which we are only provided with a measure of RV systolic pressure. This does not tell us mPAP and does not exclude that the model or the intervention alters left heart function. The Gold Standard for preclinical PAH research is (or in my view should be) to measure hemodynamics in the anesthetized, closed-chest, rodent and thereby determine mPAP, CO and PVR. This allows us to quantify PAH while excluding left heart disease. If the research has a translation goal (i.e. is intended to explore whether a preclinical therapy is suitable for human use) it should also explore potential toxicities of the therapeutic intervention, as is done in all clinical trials in humans. A drug that treats the PAH but causes toxicity to the blood, kidney, liver and other organs will not go far along the translational pipeline. Despite this truism we often see the follow of published articles that measure benefit from an intervention but lack of even rudimentary assessment of its potential toxicity, such as the EKG, CBC, creatinine and liver function tests.

As an example of how important technique is to achieve reproducible, accurate results consider the effect of a commonly used technique to simplify the measurement of PA pressure. This technique involves opening the chest to allow visualization of the vessels. Although this simplifies catheter insertion it causes a large (and variable) fall in pulmonary and aortic pressure (see Figure below). This confounder should be avoided but this requires better training in cardiac catheterization of rodents.

If a lab cannot perform invasive hemodynamics what to do? Well there are several advanced imaging techniques, notably cardiac ultrasound, which are reasonable surrogates for catheterization. The image below shows how one can detect pulmonary vascular hypertension by observing notch of the PA Doppler signal. In PAH, there is shortening of the PA acceleration time (note the shorter time from beginning of flow to the peak flow velocity in PA Doppler signal and a subsequent notching of the Doppler envelope (white arrow). This notch reflects a retrograde cancellation wave as the flow bounces off the noncompliant, downstream PA circulation. Moreover the 2-dimensional echo image shows the flattening of the LV by the huge RV in this monocrotaline-induced model of PAH.

More from Master Shakespeare?

We do what we love and we love what we do. We too often fail to test the theories advanced by our colleagues in preclinical models-which is one of the very strengths of a preclinical model over studying the human-an abundance of tissue and cells to study disease mechanisms!

There are 3 theories for how PAH cells come to grow too rapidly and avoid programmed cell death (apoptosis). First there is a heritable mutation in a gene called BMPR2, second there are changes in epigenetic mechanism that alter expression of networks of genes, and third there is a metabolic shift in the PAH cells. This Warburg metabolic phenotype is a state in which glucose oxidation is suppressed and the cell’s energy is provided by increased uncoupled glycolysis.  These 3 mechanisms have not been considered to be related one to another. Two recent studies get full marks for examining “these competing theories” and, in so doing, showing how they have a common intersection. Caruso et al and Zhang et al  showed that both genetic mutations (in the BMPR2 gene) and epigenetic changes (decreased expression of a microRNA, miR-124) promote the Warburg metabolic phenotype in PAH endothelial cells and fibroblasts, respectively. It is the Warburg metabolism that allows PAH cells to grow fast and by turning mitochondria off these poorly behaved cells avoid programmed cell death (apoptosis). These new genetic and epigenetic stimuli are a new route to Warburg metabolism, consistent with prior studies from my lab (“Epigenetic attenuation of mitochondrial superoxide dismutase 2 in pulmonary arterial hypertension: a basis for excessive cell proliferation and new therapeutic target” and “Mitochondrial metabolism, redox signaling, and fusion: a mitochondria-ROS-HIF-1alpha-Kv1.5 O2-sensing pathway at the intersection of pulmonary hypertension and cancer.”)

Recognition of this metabolic intersection brings theories together and creates harmony where cacophony once ruled. By overcoming doubt and having the courage to examine the theories of others Caruso and Zhang gained the good they might have lost had they feared to attempt these studies. We now know that Warburgian metabolic remodeling underlies increased proliferation rates in PAH vascular cells, that this metabolic phenotype can be created by many upstream abnormalities and this highlights the importance of Warburg metabolism as a therapeutic target in PAH (see references:  1  2).

Any finally wisdom from Will Shakespeare?

Measure for Measure

The Sin I refer to here is complacency with curing the disease only in a cell-which is all too easy. If a paper is focused on fundamental discovery no translational arm is required. However, if the discovery purports to have found a therapy relevant to treating humans the onus should be on the investigator to test this in vivo and ideally corroborate results of animal studies with data from human tissues and cells. This requires attention to all the design quality issues mentioned in this blog to be addressed. I offer as example a recent study we  performed in which we discovered an epigenetic mechanism in PAH that was downregulating the expression of the mitochondrial calcium uniporter (MUCU), leading to calcium overload in the cytosol (causing vasoconstriction) and calcium depletion in the mitochondria (causing Warburg metabolism). In PAH cells, we can see in panel A that MCU expression (green protein) is decreased, leading (in panel B) to low intramitochondrial calcium and (in panel C) evidence that this is due to increased expression of two microRNAs (miR25 and miR138), which decrease translation of the MCU mRNA. We show in C that giving antagomirs fixes the problem in cells, restoring MCU expression…but we did not stop there.

Instead we aerosolized these antagomirs to animals with PAH in vivo. These small anti-miRs made their way from the distal airways to the small pulmonary arteries where they worked their magic and regressed PAH (Figure below). This was by far the most expensive and time-consuming part of the study. Nonetheless, from a translational perspective, this study, and the associated assessment of potential off-target toxicity, were crucial as a step toward recommending this pathway be tested as therapy for PAH patients.

While this doesn’t prove that these anti-MiRs will work in humans with PAH we showed they worked in cells (and how they worked), confirmed this in human and animal tissues and then showed that correcting the epigenetic abnormality in vivo was beneficial and lacked toxicity. This level of evidence is a first step in the journey of a thousand miles to a new clinical treatment.  If nothing else, embracing this standard for preclinical research would reduce the number of superficial or premature publications and leave more e-parchment free for the next William Shakespeare.

So, William and I leave conclude: The Fault is in Ourselves…Dare to Design “virtuous” studies!

The interested reader is referred to the NIH’s 2016 guideline for “Principles and Guidelines for Reporting Preclinical Research

3 Responses to The flaw in ourselves: a Shakespearean guide to improve preclinical research

  1. Jeff Mewburn says:

    “Well said, old mole! Canst work i’ th’ earth so fast?
    A worthy pioneer! ….” Hamlet Act 1 scene 5

  2. SHELBY KUTTY says:

    Enjoyed the Shakespearean guide with all the pearls of wisdom!

  3. Michael Beyak says:

    Great blog. There are a few things that I’d like to add. Along the lines of the statistical issues you’ve nicely outlined, there another issue that may importantly affect reproducibility and ultimately translatability. What I’m referring to is, our convention of using a “p<0.05" to indicate a result is something that is an important and "likely true." While it is true that p<0.05 means there is less than a 5% chance of finding this result if the groups are the same (i.e. null hypothesis is true), it does not mean that there is a 5% chance of "being wrong". In fact this "false discovery rate" (i.e. the chance of declaring that there is a real difference, when in fact the groups are the same, no difference) is much higher (many cases at least 30% or more especially if n values are small). It depends on the statistical power of your study and number of observations (assuming all other parts of your design are perfect). This is an excellent article by David Colquhoun (a renowned ion channel researcher) that requires a bit of reading, but outlines the issue nicely ( However in the scientific world, these are the expectations of journals, and reviewers. (another good article on p values and the “value of a p valueless paper” Really, results should be expressed with confidence intervals that the reader can use to assess if important clinical or scientific differences exist or not. Maybe there will be a day when we publish the predicted false discovery rate based on our experiments and results. Colquhoun argues that p<0.05 means not much more than “worth another look” (bringing me to the points below).

    I'd also expand upon the issue of publication bias in preclinical research. As you've said above, it is much more likely to have your results published if the results are "positive". This is a well – known fact in the clinical medical literature, but I'd argue that the problem is even more pervasive in the preclinical / basic science realm. In fact not only do negative results go unpublished, in fact if the preliminary experiments turn up an initial negative result, then these lines of inquiry are often abandoned. So if only "positive results" are published, and the false discovery rates are high (for the reasons outlined above) no wonder we have a problem. This is compounded not only by the preference of positive over negative results, but also in the way we are rewarded with papers and grants when our hypothesis is "right". i.e. we got the positive results that we predicted (how often has a disappointed student, or PI said "the experiment didn't work"). Finally WRT reproducibility, experiments that confirm or refute the results of a previous study on the same topic are often not published. Such studies are deemed "not novel", or “incremental” or even "wrong" based on the fact that a previous study was published that said the opposite. One of the things we learn in science is that an idea is much more likely to be true if "independently confirmed", however there are few venues to disseminate these confirmatory results to reassure us about the "truth" of an idea, or to raise question about it if contrary results are observed. Ultimately I think over the next few decades there will be a shift away from the journal model of disseminating results, and peer review based more on the quality of the experimental design (free of bias, appropriate model) and meaningful statistical analysis. Of course then universities and granting agencies will have to look at new metrics of research contributions and productivity.

    Sorry for the long reply, and I'm not meaning to be too cynical about the state of preclinical, basic biomedical research (all of the pitfalls above also apply to clinical studies). We've learned a lot, and most of our major breakthroughs in medicine started in the lab. However we need to start to be cognizant of these issues and move our collective fields forward, to reach more robust and meaningful experimental conclusions.

Leave a Reply

Dr. Archer, Dept. Head
Dr. Archer, Dept. Head