Blinding and Expectancy Confounds in Psychedelic Randomised Controlled Trials
This systematic review (2021) argues that de-blinding (breaking blind) in randomised controlled trials (RCTs) of psychedelic therapies is leading to a (not defined/measurable) over-estimation of the outcomes (outside clinical trials). The authors suggest measures to tackle this and to use caution interpreting the existing RCTs.
Authors
- Suresh Muthukumaraswamy
Published
Abstract
There is increasing interest in the potential for psychedelic drugs such as psilocybin, LSD and ketamine to treat a number of mental health disorders. To gain evidence for the therapeutic effectiveness of psychedelics, a number of randomised controlled trials (RCTs) have been conducted using the traditional RCT framework and these trials have generally shown promising results, with large effect sizes reported. However, in this paper we argue that estimation of treatment effect sizes in psychedelic clinical trials are likely over-estimated due to de-blinding of participants and high levels of response expectancy generated by RCT trial contingencies. The degree of over-estimation is at present difficult to estimate. We conduct systematic reviews of psychedelic RCTs and show that currently reported RCTs have failed to measure and report expectancy and malicious de-blinding. In order to overcome these confounds we argue that RCTs should routinely measure de-blinding and expectancy and that careful attention should be paid to the clinical trial design used and the instructions given to participants to allow these confounds to be estimated and removed from effect size estimates. We urge caution in interpreting effect size estimates from extant psychedelic RCTs.
Research Summary of 'Blinding and Expectancy Confounds in Psychedelic Randomised Controlled Trials'
Introduction
Over the past two decades there has been renewed clinical interest in the therapeutic potential of psychedelic drugs (for example psilocybin, LSD and ketamine) for a range of psychiatric conditions. Muthukumaraswamy and colleagues argue that standard randomised controlled trial (RCT) methods, particularly the double-masked parallel-group design, face special challenges when applied to psychedelics because the conspicuous psychoactive effects of these drugs tend to unmask participants and raters, and the trial setting and publicity surrounding psychedelics can produce strong response expectancies that bias subjective outcome measures. This paper sets out to examine how these expectancy and masking problems affect causal inference in psychedelic RCTs. Using the Rubin causal model as a conceptual frame, the researchers review the history and logic of RCTs, survey the literature on placebo responses, expectancy and therapeutic alliance, and then report systematic reviews of existing psychedelic trials (ketamine RCTs for depression and serotonergic psychedelic trials) to document current practice. Finally, they assess trial designs and statistical approaches that could reduce bias, and make recommendations for measurement and reporting to improve the reliability of treatment effect estimates.
Methods
The paper combines conceptual analysis with systematic reviews of published clinical trials. Conceptually the authors adopt the Rubin causal model to frame how average treatment effects (ATE) are estimated and how violations of masking and expectancy assumptions can bias those estimates. They also review methodological literature on placebo responses, expectancy instruments, therapeutic alliance measures, blinding indices and related statistical tools. Empirically, several systematic reviews were conducted: one focused on placebo-controlled RCTs of ketamine for major depressive disorder, and others targeted trials of classical serotonergic psychedelics (both RCTs and open-label/other designs). The extracted text indicates search strategies and PRISMA flowcharts are provided in Supplementary Materials, but the main text does not report database names or date ranges. Included-study counts and key coding items (for example, whether pre-treatment expectancy, masking assessments, and assessment of patient–therapist alliance were reported) are summarised in the results. In addition to the systematic reviews, the paper illustrates issues using data from the authors' own crossover ketamine trial and describes practical tools and indices for assessing masking (for example the James and Bang blinding indices, and visual-analogue scales). The authors also outline several clinical trial designs (placebo lead-in, SPCD, crossover, parallel with active comparator, dose-response, pre-treatment antagonist, balanced placebo and enrichment factorial designs) and discuss their appropriateness for psychedelic research. Finally, they present a set of linear modelling approaches that incorporate participants' beliefs about allocation and pre-randomisation expectancy measures as covariates to reduce bias when post-randomisation conditioning cannot be avoided.
Results
The systematic reviews found widespread failure to measure or report participant expectancy and masking success in psychedelic trials. Across 43 included studies (ketamine and other serotonergic psychedelic trials), none reported pre-treatment expectancy measures. For ketamine RCTs, five trials reported some masking assessment, three of these provided quantitative data, but none used established masking indices and only one reported that masking had been maintained. Among serotonergic psychedelic trials, three of six RCTs reported participant masking assessments, and none of these demonstrated successful masking; two additional trials reported rater masking and only one of those showed some degree of rater masking success when assessed by whether raters could guess the trial design. None of the 43 trials reported measurement of patient–therapist alliance. Fourteen trials included more than one baseline assessment, which could permit assessment of disease trajectory, but baseline stability was rarely characterised in reports. The authors present empirical evidence from their recent crossover ketamine trial to illustrate de-masking: of 27 participants asked to guess which study day corresponded to ketamine versus active placebo (remifentanil), 24/27 (88%) guessed correctly; mean confidence for correct guesses was 7.67/10 (SD = 2.12) and for incorrect guesses was 9/10 (SD = 1). The masked outcome assessor guessed correctly 88.5% of the time with mean confidence 6.42/10 (SD = 2.54). Participants attributed correct identification mainly to psychoactive symptoms and overall symptom strength. These data are used as an exemplar of widespread masking failure. Published methodological literature cited by the authors indicates that loss of masking can inflate effect estimates: one review found unmasked trials may exaggerate effect sizes by about 0.56 standard deviations for subjectively assessed outcomes, and other meta-epidemiological analyses reported roughly a 23% increase in effect sizes for subjective outcomes when masking was absent or unclear. The authors’ systematic reviews also noted that, where masking was assessed in psychedelic trials, almost all studies reported masking failure.
Discussion
Muthukumaraswamy and colleagues interpret their findings to mean that average treatment effects reported in contemporary psychedelic RCTs are likely biased upwards because of two interacting problems: de-masking (participants and raters deducing allocation due to overt psychoactive effects) and high response expectancy created by trial contingencies and public discourse. They emphasise that subjective outcome measures common in psychiatry are especially vulnerable to these influences, and that rigorous randomisation alone does not protect against bias when masking fails. The authors position these concerns within broader methodological debates: they note historical arguments that unmasking due to true efficacy (benign unmasking) may be acceptable, but they stress the distinction between benign and malicious unmasking and argue that psychedelic trials are particularly prone to malicious unmasking via side effects and obvious subjective experiences. They draw on experimental work showing that knowledge of being in an RCT and pre-randomisation expectancy both alter placebo responses, and they highlight the role of therapeutic alliance, therapeutic setting and ritual as non-specific factors that can amplify placebo-like responses. To address these issues the paper recommends routine measurement and reporting of pre-treatment expectancy (for example using CEQ or SETS or a purpose-built scale), systematic assessment of masking (using standard questions and blinding indices such as James or Bang), and longitudinal assessment of patient–therapist alliance. The researchers review nine candidate trial designs and conclude that several common designs (placebo lead-in, sequential parallel comparison, crossover and delayed-treatment designs) are inadvisable for psychedelic efficacy trials because they exacerbate unequal belief distributions or carryover/belief changes. Designs that have promise, especially when combined with active comparators, controlled pre-treatment antagonists, dose manipulations, or deceptive/ambiguous instruction strategies (subject to ethical review), are discussed as ways to equalise belief states across arms. Analytically, they propose explicitly modelling participants' beliefs about allocation and, when possible, conditioning on pre-randomisation expectancy measures to mitigate confounding induced by post-randomisation belief (B). Linear models with main effects and interaction terms for treatment and belief are suggested, and the authors note that causal interpretation requires careful consideration of which quantities are conditioned on and how weights are applied when averaging ATE across belief states. They acknowledge limits: the extracted text states that the degree of ATE over-estimation is presently difficult to quantify, search details and full trial-level data are in supplementary materials not reproduced here, and many practical proposals (for example authorised deception) would require careful ethical consultation. Finally, the authors advocate improved reporting practices—publishing participant information sheets and advertising materials, pre-registration of full protocols and statistical analysis plans, and explicit description of the cultural context of trials—to improve reproducibility and allow assessment of expectancy and masking influences.
Conclusion
The authors conclude that standard RCT methods, when applied without modification, are insufficient to ensure unbiased estimation of treatment effects in psychedelic trials because of frequent de-masking and strong expectancies. They recommend that future psychedelic RCTs routinely measure pre-treatment expectancy, assess masking success with standard indices, and monitor therapeutic alliance. Trial design should be chosen to reduce unequal belief distributions (several candidate designs are described as more or less appropriate), and reporting standards should be augmented to include trial culture, participant information materials and full pre-publication of protocols. Finally, the authors note that integrating mechanistic evidence (for example biomarkers or neuroimaging) with improved trial methodology may help strengthen causal claims about psychedelic therapies in future research.
View full paper sections
INTRODUCTION
The last twenty years has seen a surge of interest in the therapeutic use of psychedelic drugs for a range of psychiatric/mental health conditions. Psychedelic drugs used in this context represent a fundamental challenge to the modern evidence-based medicine system which is heavily focussed on gathering evidence of treatment efficacy through "gold-standard" randomised controlled trials (RCTs). A key principle of the RCT approach is that patients are allocated to groups under double-maskedconditions, which help to prevent biases which may affect estimates of pre-defined outcome measures. Indeed, many of the outcome measures used in psychiatry are non-objective and therefore particularly vulnerable to various biases which will be described. While studies of psychedelic drugs can be masked in terms of their design, unless further steps are taken participants can usually discern to which condition they have been allocated to even when active placebos are used. This "unmasking" reduces the impact of masking on controlling bias and lessens the accuracy of treatmenteffect size estimation. As such, the evidence for efficacy obtained from therapeutic RCTs of psychedelic drugs where masking has clearly failed falls short of the evidence obtained in other areas of medicine. In this paper it is argued that how standard clinical trial designs are applied to gather evidence for the efficacy of psychedelic drugs should be re-considered. This paper begins with a description of the randomised control trial and how it is used to establish causal relationships between treatments and outcomes by the formalism of the Rubin causal model. The historical evolution of the RCT and its design features are described along with the emergence of the first wave of psychedelic drug clinical trials. Following this is a discussion on the nature of expectancy effects in RCTs and how expectancy can be modified by experimental contingencies in an RCT. Next, the nature and importance of masking in RCTs is considered along with methods that can be used to measure the success of masking in RCTs. The state of the psychedelic trial literature since the 1990s is then subjected to several systematic reviews to examine the reporting of expectancy and masking in two sub-groups of studies: RCTs of ketamine in depression and serotonergic psychedelic trials of any design in mood disorders. A clear pattern emerges where the field has failed to measure, or at least report, masking success or pre-treatment participant expectancy. We conclude the paper by considering a number of clinical trial designs that are, and are not, appropriate for use in psychedelic RCTs as well as several statistical models that could be used to estimate treatment effects under different assumptions.
CAUSAL INFERENCE AND THE RANDOMISED CONTROLLED TRIAL
At their heart, clinical trials consist of experimental units of observation (participants), treatments and the evaluation of outcomes (seefor comprehensive descriptions). The aim of the randomised controlled clinical trial is to establish the existence of a causal relationship between treatments and outcomes. In this regard it is worth considering the logic behind making causal inferences from clinical trials within the formalism of Rubin's causal model. Let U be a population of units with u denoting an individual unit and Y be a response variable we will seek to explain, with Y(u) being the response of an individual unit. Take a simple case in which there are two binary treatment options (t for treatment and c for control) and let S be a variable that indicates to which treatment a unit is exposed, hence S(u) =t ∨ S(u) =c. Notably, Y is an attribute of U, whereas S indicates exposure to a cause and hence Y(u) must be measured at some time after exposure of u to S(u). Prior to exposure there are two potential states Y could have been in the future, that is Y t or Y c . Any unit could then take values Y t (u) and Y c. The causal effect of t on u is called the individual treatment effect (ITE) and can be defined by: Importantly, of the two potential outcomes for any individual at any point in time (Y t (u) and Y c (u)), one is counterfactual and can never be observed -this is termed the fundamental problem of causal inference. That is, at any time-point it is impossible to observe both Y t (u) and Y c (u) without making additional assumptions. Two approaches described by Hollandto overcome the fundamental problem of causal inference are to make assumptions of homogeneity or invariance. The assumption of homogeneity is that all units in a population would elicit the same treatment effect while the invariance assumption is that treatment effect would be the same at all times for a unit -regardless of previous events/exposures. These are unrealistic assumptions for many medical interventions, so the approach taken in clinical trials is a statistical approach, which is to attempt to quantify the average treatment effect (ATE) over a population of interest U. The ATE can be described as: where E represents the expected value. Notably, this approach does not assume that ITE is constant across U (no homogeneity assumption). In terms of observed data in an experiment, a single response variable (Y s ) is measured alongside our intervention to give the variable pair (S, Y s ). Observed data then provide: But these conditional expected values need not be the same as E(Y t ) and E(Y c ). This is because E(Y t ) is the average of all u in U whereas 𝐼𝐼(𝑌𝑌 𝑡𝑡 | 𝑆𝑆 = 𝑡𝑡) is the average over only those u exposed to t (and similarly for c). However, randomisation of a sufficient number of samples of u to either t or c can make S statistically independent to Y s such that it is reasonable that: A further assumption of the Rubin causal model is the stable unit treatment value assumption (SUTVA) in which sampling of one individual is unaffected by the assignment of other individuals. In clinical trials the SUTVA assumption generally holds as individual cases are typically dealt with in isolation, however in group intervention situations, such as group therapy, this would become more complex and the SUTVA assumption might not hold. While the aim of the clinical trial is to evaluate the causal effect of treatments by the estimation of ATE, Deacon and Cartwrightpoint out a few key ways that ATE can be easily misinterpreted. Firstly, the ATE estimated for the trial sample does not apply to every member of the trial sample. Indeed, even if ATE = 0 it is incorrect to state that the treatment does not have an effect in every unit in the trial sample -as it is not possible to estimate the counterfactual for each individual unit. Secondly, when heterogeneous populations are studied, the ATE in the trial is sample is more likely to differ from the population it is selected from. This is particularly true for small samples and it is also true that as sample size decreases the probability that the independence assumption is achieved through random sampling decreases -especially if there are many potential covariates of interest. Finally, the extent to which ATE is "true", that is, an accurate, unbiased estimator, it is only true for the trial sample. Even after meeting the internal validity criteria described above it is an open question as to how the effect size may apply to the general population. One should be particularly wary of this for the populations studied in psychedelic drug trials. Out of an appropriate abundance of caution given the powerful effects of the substances involved and to reduce potential confounders there are typically rigorous screening criteria applied to potential participants. These criteria, and self-selection / self-exclusion biases, may underlie our informal observation that the attrition rate in the steps leading up to participant randomisation in psychedelic trials is high. One could argue the higher this attrition rate, the less generalisable ATE becomes to the general population, as units in the trial will be more similar than they are in real life random samples of that population. The problem of external validity was not lost on the pioneering epidemiologist Sir Austin Bradford-Hill who stated "At its best a trial shows what can be accomplished with a medicine under careful observation and certain restricted conditions. The same results will not invariably or necessarily be observed when the medicine passes into general use." (cited in).
BRIEF HISTORY OF THE RCT AND PSYCHEDELIC RCTS
For causal inference to be made between treatments and outcomes, and for estimates of ATE to be unbiased, the experimentalist must control a number of potential biases. The goldstandard RCT developed through cumulative progress over the course of the 20th centuryto reduce potential biases and strengthen causal inference in clinical research. In the 1920s, randomisation of participant groups to minimise selection bias became more common. Subsequently, Bradford-Hill introduced the notion of masking in the 1940s in the UK in a series of clinical trials investigating the antibiotic streptomycin for the treatment of tuberculosis. The innovations of Bradford-Hill were tremendously influential, and these standards were successively adopted into legislative frameworks. In the USA, the Kefauver-Harris Amendments to the Food, Drug, and Cosmetic Act were passed in 1962 and the RCT became the tool by which pharmaceutical manufacturers could demonstrate efficacy and safety to regulators. By 1970 it was a requirement of the FDA that new drug applications submit RCT results as part of their submissions. Although not specifically defined, it was clear from Congressional directives that well-controlled studies should include "as a minimum, the use of control groups, random allocation of patients to control and therapeutic groups, and techniques to minimize bias". In a parallel history, investigations into the medical use of psychedelic drugs exploded in the 1950s following the discovery of the psychedelic properties of LSD by Albert Hoffman in 1943. Between 1950 and the mid-1960s there were more than a thousand clinical papers including results from 40,000 patients administered psychedelics. Successive tightening of legislation, restricted access to the supply of pharmaceutical-grade psychedelic drugs and societal pressures meant the last NIMH project using LSD on human participants ended in 1968. Because psychedelic research in this era occurred as modern clinical trial methodology was developing, the trials rarely meet modern standards for RCT design and/or reporting. For example, in a modern systematic review of studies conducted in this era for the use of LSD to treat alcoholism, only 6 of 33 studies met modern criteria of randomisation / masking / outcome assessment. Notably all of these studies happened late in the first psychedelic era.
PLACEBO RESPONSES, EXPECTANCY AND ALLIANCE
According to the Oxford Dictionary, the word placebo has been in use since 1811 and was defined as "medicine given more to please than benefit a patient" (Gaddum cited in). Placebos have been well-known to, and used by, physicians for centuries (see) but it was with the development of the RCT that the placebo response began to be considered in its modern context within the framework of biomedical science. The placebo response refers to the observed changes in symptoms in those participants who have been randomised to a placebo control group in an RCT. The placebo effect is the therapeutic effect of receiving a treatment that is not caused by any inherent properties of the treatment or due to natural progress of the disease. In a landmark paper in 1953, Beecherfirst attempted to quantify the placebo response. He observed across 15 trials in various conditions that ~35% of patients were satisfactorily treated by treatment with a placebo. In a controversial metaanalysis conducted in 2000, Hrobjartsson and Gotzcheassessed trials which included a placebo arm and a no-treatment arm and found that, in trials with objective and/or binary outcomes, placebo groups showed no significant effects over and above those seen in the notreatment arm. However, significant effects did emerge for those participants in placebo arms over those in no-treatment arms where continuous, subjective outcome measures were used -such as those used in the assessment of pain. It has been argued that the approach of Hrobjartsson and Gotzchelikely underestimates the placebo effect in clinical practice; in RCTs there is doubt about whether a participant will receive a treatment or a placebo whereas in clinical practice active treatment is always given. Underestimation may also be driven by the inability to distinguish between those participants in the placebo groups who believed they had received the treatment and those who did not. As will be shown in this section, belief and expectancy are important mediators of the placebo effect. A number of commentariesmore fundamentally criticise the attempt of Hrobjartsson and Gotzche to ascertain the size of "the" placebo effect by using meta-analytical techniques. There is no singular placebo effect to be quantified, rather the size of the placebo effect in any particular clinical trial will be modified by the specific experimental contingencies employed in each trial. To understand the nature of the placebo effect Miller and Rosensteinusefully discuss four types of healing; active-treatment (AT) induced healing, placebo-induced healing, healing induced by clinician-patient interaction and spontaneous natural healing (NH). The placebo effect (PE) is often defined as a combination of placebo-induced healing and healing induced by clinical-patient interaction as these are difficult to disentangle. Turner et al.defined the placebo effect as being composed of the non-specific effects of treatment including factors such as "physician attention, interest, and concern in a healing setting; patient and physician expectations of treatment effects; the reputation, expense, and impressiveness of the treatment; and characteristics of the setting that influence patients to report improvement". Natural fluctuations in disease course potentially confound both treatment and control groups of psychedelic trials. This can include spontaneous healing, but also potentially regression to the mean where participants might be most likely to volunteer for extreme interventions at their time of greatest clinical need. The confidence from which one can distinguish signal from noise in treatment of an individual over and above natural historyis enhanced with increased baseline stability of the disease in the pre-treatment period relative to the duration of the treatment period. In our analysis of recent psychedelic trials (see Section Reporting of Masking and Expectancy in Psychedelic RCTs) baseline stability of disease course was rarely characterised. It is important to note that the control group in the parallel groups design as described with the Rubin Casual Modelmeasures the placebo response but not the causal placebo effect as it is confounded by spontaneous healing effects. In order to distinguish the placebo effect from spontaneous healing a no-treatment control group is required. If such a group (let n = no treatment) is included in an RCT, then one can extend the counterfactual frameworkof the Rubin Causal Model to estimate an Average (causal) Placebo Effect (APE) as: While inclusion of an additional no-treatment group might be useful in order to measure an APE, the estimation would only be unbiased to the extent that adequate masking is maintained (see Section Assessment of Masking Success in Clinical Trials). It has been shown that the knowledge that one is participating in a clinical trial can influence the size of an estimated APE and also ATE. If the RCT is the "gold standard" trial then consider what Kaputchukcalls the "platinum standard" trial -where both clinicians and patients are actually unaware that they are taking part in an RCT. Although ethical standards of informed consent now forbid this type of deception, there are some historical examples that can be drawn upon. In one study of insomnia patients, all patients were being given dummy sleeping pills, with half the participants unaware they were even in a trial. The other half of the participants who knew they were in a trial and thought they could be getting either placebo or active treatment, showed less improvement than those patients who were unaware they were in a trial at all and received a placebo by deception. That is, in this scenario, knowledge of being in an RCT decreased the placebo response. In a complementary example, patients received either naproxen or placebo in a crossover trial for cancer pain. One group of patients were unaware of the experiment while the other group were aware of the design after giving informed consent to participate. In this case both the placebo and active treatments in the group who gave informed consent were more effective than the same treatments in the group of patients who had no knowledge of the RCT. Further, the placebo response in the informed group was actually greater than the active response in the uninformed group. Taken together, these experiments show not only that the process of informed consent can modify the placebo response but that the direction and magnitude of the placebo response depends on experimental contingencies. Consider the healing components as per Miller and Rosensteinthat contribute to a three arm experiment under the Rubin Causal Model where N = no treatment group, C = control (placebo) group and T = treatment group then under a model of linear additivity of therapeutic effects: From this, APE is defined by row subtraction (C-N) and ATE by row subtraction (T-C). However, these subtractions only hold under the assumption that 𝑁𝑁𝑁𝑁 𝑛𝑛 = 𝑁𝑁𝑁𝑁 𝑐𝑐 = 𝑁𝑁𝑁𝑁 𝑡𝑡 and 𝑃𝑃𝑃𝑃 𝑐𝑐 = 𝑃𝑃𝑃𝑃 𝑡𝑡 . The first assumption is reasonable under randomisation conditions, however, as the platinum standard trialsdiscussed previously demonstrate, this latter equivalency need not be the case and with the potential for unmasking in psychedelic trials this is even more uncertain (see Section Reporting of Masking and Expectancy in Psychedelic RCTs). Previously, interactive models for combining the effects of AT and PE have been proposed. We note that for 𝑃𝑃𝑃𝑃 𝑐𝑐 ≠ 𝑃𝑃𝑃𝑃 𝑡𝑡 then 𝑻𝑻 = 𝑁𝑁𝑁𝑁 𝑡𝑡 + 𝑃𝑃𝑃𝑃 𝑡𝑡 + 𝐴𝐴𝐴𝐴 𝑡𝑡 and for an interactive model 𝑻𝑻 = 𝑁𝑁𝑁𝑁 𝑡𝑡 + 𝑃𝑃𝑃𝑃 + 𝐴𝐴𝐴𝐴 𝑡𝑡 + d𝑃𝑃𝑃𝑃𝐴𝐴𝐴𝐴 𝑡𝑡 (where d is an interaction term) then model comparison gives 𝑃𝑃𝑃𝑃 𝑡𝑡 = 𝑃𝑃𝑃𝑃 + d𝑃𝑃𝑃𝑃𝐴𝐴𝐴𝐴 𝑡𝑡 which are functionally equivalent for the experimentalist. Whether the additive model for psychedelic RCTs holds is untested, but proponents of psychedelic medicine have long held that both "set and setting" are essential components in the healing processsuggesting an interactive model. Strassmanhas gone further arguing that the effects of psychedelics are entirely due to its amplification of the placebo response. That is 𝑪𝑪 = 𝑃𝑃𝑃𝑃 𝑡𝑡 and 𝑻𝑻 = 𝑒𝑒𝑃𝑃𝑃𝑃 𝑡𝑡 (e could be either a constant or a function). In Section Trial Designs for Psychedelic RCTs some alternative designs are considered that might help to separate specific and non-specific (or set and setting) factors of psychedelic therapy. However, even within Strassman's framework to demonstrate efficacy is still necessary to show that 𝑒𝑒̂ for a psychedelic intervention is significantly different to 𝑒𝑒̂ for a (successfully masked) placebo. What other factors, beside the knowledge one is in an RCT might influence the placebo effect in an RCT? In a study examining the placebo effect in acupuncture, Kaptuchuk et alseparated the placebo effect into three components: a patient's response to observation and assessment, response to therapeutic ritual, and response to patient-clinician interaction. They found that all three elements contribute to the placebo effect with patient-clinician interaction being the largest contributor. Drawing on common factors theory of psychotherapy, Gukasyan and Nayakidentify four factors relevant to psychedelic medicine -1) the therapeutic relationship between patient and therapist, 2) the healing setting, 3) the rationale, conceptual scheme, or myth underlying the treatment and 4) the ritual of the treatment itself. Many of these contextual factors could be manipulated in psychedelic designs to test their influence (seeand Section Trial Designs for Psychedelic RCTs). Physical features of the treatment such as its appearance, route of administration, price, brand and expected dose can all modify the placebo effect in mood disorders. A key contributor to the placebo response in RCTs is expectancy -primarily response expectancy. Response expectancyrefers to patients' predictions/beliefs of their own nonvolitional response to treatment whereas (less relevant in this context) stimulus expectancies refer to anticipation of external events. Zilcha-Mano et al.argue that expectancy, similar to alliance, can be composed of trait-and state-like components. The trait-like component reflects patients' individual differences in levels of expectancy whereas the state-like component refers to changes that occur over the course of treatment. Expectancy is known to influence the estimation of ATE in RCTs of standard antidepressants. A meta-analysisof RCTs of antidepressants compared placebo-controlled trials to activecomparator trials -where another approved antidepressant is used as a comparator condition. It was found that the overall odds of an antidepressant response being observed is significantly higher in active-comparator trials (odds ratio = 1.79). A reasonable explanation of this is that in active-comparator trials patients will have higher expectancy of receiving an active treatment. Further, a meta-regression approachfound that the greater the probability of receiving a placebo, the lesser the placebo response. Similarly, a meta-analysisof 1504 patients undergoing psychotherapy in 19 studies found patient pre-treatment expectancy of positive outcome predicted treatment efficacy, albeit relatively weakly. In the psychedelic research field, a recent microdosing study demonstrated that positive expectancy scores at baseline predicted improvements of well-being during the trial. One existing tool that can be used to measure expectancy is the six-item Credibility/Expectancy Questionnaire (CEQ) which has been shown to have high internal consistency and test/re-test reliability. In an antidepressant trial, Rutherford et aladministered the CEQ as a simple two-item scale suitable for antidepressant trials. Item 1 uses the question, "At this point, how successful do you think this treatment will be in reducing your depressive symptoms?" and is rated on a 9-point Likert scale from not at all successful (1) to very successful. Item 2 uses the question, "By the end of the treatment period, how much improvement in your depressive symptoms do you think will occur?" and is rated on an 11-point Likert scale with anchors at 0% and 100%. Rutherford et alused the sum of these two questions to provide an overall CEQ score and found that higher baseline expectancy of improvement correlated with improved treatment outcomes. In a prospective RCT designed to manipulate expectancy, Rutherford et alrandomised patients with a 1:1 allocation ratio to either an open-label antidepressant intervention or to a parallel-groups trial. Patients were informed of which group they were in with open-label patients told they would receive the active intervention while those in the parallel groups trial were told they could either be in an active or a placebo group. Expectancy was measured before and after randomisation and it was found that expectancy of a positive outcome significantly improved in the open-label group post-randomisation but not in the parallel-group -indicating successful experimental manipulation of the state-like component of response expectancy. Following an 8-week course of antidepressants, ATE was found to be larger in the open-label group compared to both arms of the parallel groups trial with post-randomisation expectancy acting as a mediator of ATE. Not only is this trial an elegant experimental demonstration of how expectancy can be manipulated to alter ATE, it serves as a demonstration of how novel RCT designs can be used to decouple expectancy from ATE, and how expectancy can be measured over time. Another standardised scale that could be used is the Stanford Expectation of Treatment Scale (SETS)which captures some different domains to CEQ. Ultimately however, psychedelic medicine may benefit from the development of a purposebuilt expectancy scale given the unique features of these studies. Although most studies examining the potential use of ketamine in depression to date (See Section Reporting of Masking and Expectancy in Psychedelic RCTs) have not incorporated psychotherapy (but note), the model of psychedelic medicine being adopted for classical psychedelics such as LSD/psilocybin incorporates a strong element of psychotherapy. Generally in psychotherapy, the development of a strong working relationship (therapeutic alliance) between patient and therapisthas a crucial effect on the success of treatment, with a meta-analysis including 30,000 patients from 295 studies demonstrating an effect size of d=0.579. Indeed, it has been shown that placebo responses (to acupuncture) are enhanced when practitioners have a warm, friendly manner compared to a neutral manner. Consistent with the notion of therapeutic alliance, meta-analyses of RCTs of standard antidepressantshave shown an increased number of follow-up visits accounts for ~40% of the placebo response in those allocated to placebo arms. Indeed, increased visit numbers are thought to partly account for the increase in placebo response size that has been found in RCTs of antidepressants over the last 30 years. Measurement of patientpractitioner therapeutic alliance can be performed using a number of well-established psychometric tools such as the CALPAS, HAQ, HAQII and WAIwhich could easily be incorporated in a longitudinal mannerin psychedelic studies that incorporate elements of psychotherapy. Several studies now suggest that alliance can be conceptually separated into trait-and state-like components of which the modifiable state-like components relate to therapeutic outcome.
MASKING IN TRIALS OF PSYCHEDELIC DRUGS
The International Council for Harmonisation of Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH)guidelines summarise the role of masking in clinical trials: "Blinding or masking is intended to limit the occurrence of conscious and unconscious bias in the conduct and interpretation of a clinical trial arising from the influence which the knowledge of treatment may have on the recruitment and allocation of subjects, their subsequent care, the attitudes of subjects to the treatments, the assessment of end-points, the handling of withdrawals, the exclusion of data from analysis, and so on. The essential aim is to prevent identification of the treatments until all such opportunities for bias have passed." (pg. 11). One important issue related to psychedelic drug trials is that while trials can be double-masked "by design", retrospective examination of the maintenance of masking is rarely reported and when it is it often reveals that masking was not maintained. For example, take our recent crossover-RCT assessing antidepressant responses to ketamine (0.43 mg/kg) which included the active placebo remifentanil (1.7 ng/ml). For reference, a targetcontrolled infusion of 1.7 ng/ml is a fairly sedative dose -with much higher levels causing increasing frequency of apnoea in participants. At the end of the trial, 27 participants were asked which study day they thought was which prior to being unmasked. Participants guessed correctly 88% of the time (24/27) and scored their confidence with an average of 7.67/10 (SD=2.12). The other 12% guessed incorrectly with an average 9/10 confidence (SD=1). The most common reason given by participants for a correct guess was psychoactive symptoms, followed by stronger symptoms overall, and then the magnitude of the antidepressant response to ketamine. The reason for incorrect guesses appeared to be driven by mistaken expectations of how the drugs would feel. Similarly, the masked outcome assessor guessed correctly 88.5% of the time and scored their confidence with an average 6.42/10 (SD=2.54). These data suggest that for our trial, breaking of masking was widespread. We contend that the estimation of ATE in all psychedelic clinical trials is overestimated due to unmasking effects. Given the obvious psychoactive effects of psychedelic drugs, those in an active intervention group likely know they have received the treatment and may show greater treatment response due to expectancy effects. Those participants that receive a placebo intervention may well know they have received the placebo and disappointment may decrease their placebo response. Added together, these tendencies would increase the overall estimate of ATE. Consistent with this, a previous systematic review of medical RCTs examined studies which had both masked and non-masked trial arms along with subjectivelyassessed treatment outcomes and demonstrated that effect sizes can be exaggerated by 0.56 standard deviations in unmasked trials. Similarly, meta-epidemiological analyses have shown that studies that use subjective outcomes to measure ATE and that lack, or have unclear, double-masking procedures demonstrate a 23% increase in effect size. In a meta-epidemiological analysis of osteoarthristis RCTs, it was found that effect sizes were significantly reduced in masked versus non-masked trials. Application of the Cochrane Risk of Bias Tool (RoB-2) to psychedelic trials is likely to find at least moderate risk of bias in terms of "deviations from the intended interventions" and "effect adhering to intervention to categories". A previous Cochrane review on the use of ketamine in treatment-resistant depression raised concerns around the adequacy of masking in these studies. In antidepressant trials it has been shown that when active placebos are used to mimic the side effects of active treatment, effect sizes tend to be more modest than generally reportedsuggesting that unmasking may inflate ATE estimation in antidepressant trials. One potential counterargument to the importance of maintaining masking in trials is that the clinical benefits of psychedelics are so large in terms of effect sizes that they outweigh the needs for masked trials, indeed RCTs at all. This is not a new idea in medical practice. For example, Galasziou et al.provide a number of examples where RCTs have not been not needed for treatments to be accepted into medical practice due to effect sizes being dramatically large. Some examples include de-fribillation for ventricular fibrillation, insulin for diabetes, and the effects of anaesthesia. Notably, of the sixteen examples provided by Galasziou et al.all had clear objective outcome measures. Moreover, the effects of these interventions are so large that it would be neither ethical to conduct RCTs or plausible to mask them, so dramatic are the intervention effects. Philip's paradox states that the more potent a therapeutic treatment the less likely its efficacy can be shown in a double-masked trial. That is, the more powerful the treatment the less likely the patient (and observers) can be masked and the more likely biases will affect the study. This argument rests on the premise that an observed effect size will be comfortably larger than the purported effects of all potential confounders. But are the effect sizes in psychedelic trials so large? In a trial of ketamine for depression with midazolam as a placebo control, 28% of participants assigned to the placebo condition were classified as responders -with a greater than 50% reduction in depression scores. Similarly, in our study which used remifentanil as an active control for ketamine, 10% of participants were classified as responders after the placebo intervention. These relatively large placebo effects seen with psychedelic trials, combined with the subjective nature of the outcome measures, make it difficult to rely on this 'large-enough' effect size argument. By comparison, the outcomes of the interventions mentioned by Galasziou et alinvolve clear objective indicators such as mortality. To solve Philip's paradox for clinical trials, Howickconsiders the nature of what a confound is. Generally, in order for a variable to be considered a confounder in a clinical trial, the variable must affect the outcome, be unequally distributed across groups and not be on the causal pathway between treatment and response. Howick distinguishes two types of unmasking -benign unmasking which is caused by the effects of the intervention on the target disorder and malicious unmasking which is caused by side effects of the intervention. Psychedelic drugs create a rich array of psychological and physiological experiences and it would be difficult to argue which were incidental to any effects required for clinical efficacy. As we have seen in the qualitative reporting from our ketamine trial the primary reason given by participants for drug identification was the psychoactive effects of the drugs, not their beliefs about antidepressant efficacy. Howickargues that in order to serve its function, double-masking must have successfully controlled malicious masking. In subsequent sections we will consider various trial designs that might aid in reducing de-masking effects in psychedelic trials, but first methods for evaluating masking success in clinical trials will be considered. We end this section by noting that if one does not believe that psychedelic trials could ever be adequately masked then it becomes unclear what the point of an RCT for psychedelics would be? Such a study would always be confounded, and the comparator control group would have no utility. If masking cannot be successful then running single group open-label trials would be more useful for accurately characterising treatment response given resource constraints, while still allowing examination of the effects of baseline covariates such as expectancy, however, this would preclude the ability for causal inference regarding ATE to be made.
ASSESSMENT OF MASKING SUCCESS IN CLINICAL TRIALS
The CONSORT 2010 statementdoes not recommend that investigators report on the efficacy of masking of participants in trials -only that any explicit breaks in masking are reported. This was a clear difference from the CONSORT 2001statement which recommended that the success of masking be evaluated. Even with the earlier CONSORT recommendation, masking success was evaluated and reported in clinical trials infrequently. In a meta-analysis of medical and psychiatric RCTs, Fergusson et al.evaluated the number of trials that assessed masking. For psychiatric trials, only 8/94 assessed masking success with four reporting sub-optimal masking. Only two of these measured masking in outcome assessors. Similarly, Hrobjartsson et al.inspected 1599 medical RCTs from the Cochrane Register of Clinical Trials and found only 31 reported tests for the success of masking. On being contacted, a subset of authors (12% out of 200) reported they had conducted but not reported measurements of masking effectiveness. In response to these papers, Sackettpointed out there is uncertainty around whether responses to masking assessment questions reflect potential failure of masking or accurate assumptions of the intervention's clinical effects. Sennfurther argued that from one perspective the point of a clinical trial is for patients to become unmasked due to efficacy, as long as the unmasking is not incidental to efficacy. Such arguments were followed by the removal of masking assessment from the CONSORT 2010 statement. However, the arguments of Sackett and Senn refer to benign rather than malicious unmasking. While these arguments are relevant to medical drug interventions like the RCTs of aspirin as alluded to by Sackettthey are less relevant to psychedelic drugs with their clear psychoactive effects. Notably, Sackett does not suggest that de-masking effects be ignored -rather he suggests that specific confounders that might cause de-masking are measured and evaluated in a controlled fashion. However, Sackett's approach relies on there being no unknown confounders and accurate estimation of measured confounders which is not realistic -especially for psychiatry where underlying mechanisms are not completely understood. As Howicknotes there is no reason that a trial cannot combine both by measuring potential confounders and the success of masking. There is no standard approach for assessing masking efficacy. One method is to allow participants to guess which arm of the trial they are in as a categorical response variable : "drug", "placebo", "don't know". Another variation is to use a five-point Likert scale including a "don't know" option which adds a degree of uncertainty. For categorical responses answers can be summarised in a contingency table such as Table. ). Two proposed masking indexes to estimate the overall effectiveness of masking from these metrics with extensions for Likert-scale data are the James Blinding Index (BI James )and the Bang Blinding Index (BI Bang ). The James Blinding indexis calculated as: where and Weights (w ij ) are assigned such that 0 is assigned for correct guesses, 0.75 for incorrect guesses and 1 for "don't know" responses. BI James can range between 0 and 1. When all responses are correct, and masking has failed BI James = 0 and when all responses are don't know BI James = 1 indicating completely successful masking. BI James = 0.5 would indicate random answers. In practice, if the confidence interval (see) of the estimate contains values < 0.5 then masking is said to have failed for the trial. Bang et alintroduced an alternative Blinding Index (BI Bang ) which allowing masking assessment to be measured for each arm of a trial.where: 𝑟𝑟 𝑖𝑖|𝑖𝑖 = 𝑃𝑃 𝑖𝑖|𝑖𝑖 �𝑃𝑃 1|𝑖𝑖 + 𝑃𝑃 2|𝑖𝑖 �In each arm of the trial an estimate BI Bang with range of -1 ≤ BI Bang ≤ 1. A BI Bang value of 0 indicates chance responding (masking) while a positive value indicates unmasking and a negative value incorrect guessing (seefor variance estimator calculation). For psychedelic trials it seems intuitive that masking may not be symmetrical between the trial arms, making BI Bang a more appropriate index to use. However, it may be desirable to better capture the degree of uncertainty in each participant rather than having a single tertiary variable for the trial as a whole. A simple visual analogue scale (VAS) with end anchors at placebo and active with a middle anchor at "don't know" might be useful as a covariate in the analysis of individual outcomes. When assessing de-masking an important consideration is when after the intervention to perform this assessment. While it is perhaps intuitive to assess masking at the end of the trial, if some unmasking reflects speculations about efficacythen masking assessment performed earlier may be more appropriate. We would suggest that masking assessment could be performed relatively soon after a dose of psychedelic but before assessment of outcomes and before participants begin to self-perceive potential treatment effects. In that way the assessment would focus on malicious confounding effects rather than estimation of efficacy. One could potentially assess masking at multiple time-points to assess the trajectory of de-masking. For example, in an RCT examining water treatment units in houses, Rees et al.measured participants' guesses as to group allocation sequentially through the trial. They found that 31% of participants changed their belief allocation across the duration of the trial -an effect that would be missed from gathering a single time-point measurement of belief allocation. One concern with repeated measurements is that drawing attention to the issue may create a response bias. Another important issue is how de-masking is operationalised practically. Although responses can be gathered verbally, social desirability biases and the impact of experiments covertly cueing their beliefs provide justification for computerised assessment.
REPORTING OF MASKING AND EXPECTANCY IN PSYCHEDELIC RCTS
In order to understand the state of the literature regarding the measurement of expectancy, de-masking and alliance in psychedelic trials, we conducted several systematic reviews. Firstly, we examined placebo controlled RCTs of ketamine for the treatment of major depressive disorder since a relatively large number of trials exist in this area and there are recent systematic reviews of this topic. Secondly, we searched for trials of classical serotonergic psychedelics. Given there were relatively few trials we conducted one search for RCTs and one for open-label/other designs. Details of the search strategies used and PRISMA flowcharts are given in Supplementary Materials. Tables 2-4 reveal that of the 43 included studies none had measured pre-treatment expectancy of participants. For ketamine RCTs (Table) of depression five trials measured masking with three providing quantitative data, although none used existing masking indexes as described in Section Assessment of Masking Success in Clinical Trials. Of these, only one reported maintenance of masking. For other psychedelic RCTs (Table), three out of six trials reported on participant masking, with none demonstrating successful masking. Two of the remaining trials reported rater masking, with only one reporting a degree of success, assessed by whether raters could guess trial design. None of the 43 trials considered described any assessment of patient-therapist alliance. Fourteen trials reported more than one baseline assessment -which would allow an analysis of disease trajectory -typically obtained during participant screening. Interestingly, the majority of trials did discuss that their analyses were limited by the potential for de-masking. Some of these more qualitative concerns are detailed in Supplementary Materials. In summary, these systematic reviews demonstrate that existing psychedelic RCTs have failed to measure the effects of participant expectancy and therapeutic alliance. Furthermore, only a minority of studies reported a masking assessment, with no attempts at a formal statistical analysis, and all but one reporting masking failure.
THE PATIENT TRAJECTORY THROUGH AN RCT
Before considering potential solutions to masking/expectancy/alliance issues in psychedelic RCTs we first review the trajectory of a patient through an RCT and where biases can occur during this process. Identifying these sources of bias is critical before proposing potential remedies. Firstly, it is important to reflect on the fact that the experimental "units" in RCTs do not live in an information vacuum but instead are exposed to a significant amount of knowledge which could potentially shape their response expectancy, their therapeutic alliance, and affect the success of their masking. Specifically, in the context of psychedelics, the popular media has extensively, and generally positively, covered the potential of these drugs to treat mental health conditions. While this positive media coverage is good for the field of psychedelic research after decades of de-facto prohibition, it is also likely to amplify response expectancy in patients who enrol in a psychedelic RCT. The pathway of participants through a clinical trial is illustrated in Figure. The first-step, trial advertisement, causes an initial sampling bias in terms of which participants from the general eligible population are reached. For trials which operate through clinician referrals, the referring clinician (who may not be involved in the study) may create biases in their selection of patients to refer. The decision of the participant to enquire further around participating in the trial will involve a self-selection bias. The shaping of response expectancy potentially begins at even this earliest stage of an RCT. What information is provided in the RCT advertisement? Do participants know the trial will involve a psychedelic intervention and for what purpose? Our ethics committees have generally encouraged us to use neutral terms in advertising materials such as "new potential antidepressant" to reduce the risk of participants seeking access to controlled substances. Participants who are unwell enough that they seek to take part in an RCT with a relatively extreme, novel intervention may be at their most unwell in their disease time-course. This creates the potential for regression to the mean to occur over the course of a participant's trial enrolment. Usually, an RCT will involve a triage step involving some pre-screening questions either by email/phone/questionnaire to assess eligibility. This may be handled by a relatively junior member of a research team and may create another potential source of experimenter bias. Provision of potential participants with Participant Information Sheets (PIS) and Consent Forms (CF) provides patients with extensive information regarding the interventions and trial contingencies. In order to meet the requirements of informed consent Ethics Committees generally require that PIS/CFs provide information regarding the benefits and side effects of medicines. In our experience Ethics Committees generally request a description of some of the phenomenological experiences that participants may encounter during the administration of a psychedelic drug. Provision of such information is hard to escape if consent is to be obtained ethically, but it is important to acknowledge that it may reduce masking even in an active-placebo-controlled trial. Further, given that participants will know what drugs they could experience from the PIS, they may easily get more information about the effects of the interventions from the internet to supplement that provided by the research team. In our experience, participants doing their own research regarding drug interventions is common. As we have seen, participant knowledge of experimental designsuch as the number of arms in the trial and knowing the probability (allocation ratio) that they will receive placebo can shape participant expectancy and resultant treatment outcomes. It is important that what participants are told about the trial is clearly reported in the methods sections of RCTs. However, the environment and cultural context of pharmaceutical trials is rarely described in RCT reports, and these are likely to have a large impact on both treatment and placebo responses. In a noteworthy example, in the trial protocol of Ballou et alwho investigated open-label placebo responses for irritable bowel syndrome, a section was dedicated to "The Culture of the Trial" which contextualised the trial in terms of setting and how participants were interacted with by the research team. For replicability purposes we would suggest this practise should be adopted into psychedelic clinical trial reporting as well as detailing how experimental contingencies are communicated to participants. Providing advertising materials and PIS/CFs as supplementary materials would be useful in this regard. The next step in the research process would typically be in-person screening and at this stage patient-therapist alliances can begin to form. From here and during the treatment phase, rater biases as well participant biases including Hawthorne effects and social desirability can emerge. Post-randomisation attrition bias (dealt with using Intention to Treat analyses) and carryover effects in crossover trials can also be present. Finally, we note that throughout the entire research process researcher biases can easily emerge. The scientists engaged in psychedelic research are self-selected and have their own personal motivations to conduct research in this area. Most psychedelic RCTs (Section Reporting of Masking and Expectancy in Psychedelic RCTs) reported to date are relatively small Phase 2 trials conducted in academic settings whereas large-scale Phase 3 pharmaceutical trials are generally conducted at Contract Research Organisations removed in operational terms from the trial sponsors. Both types of trials have their own potential sources of biases based on the motivations (academic vs commercial) to conduct the trial. One excellent remedy for these biases is to publish registered reports with peer review including statistical analysis plans prior to the commencement of the psychedelic RCT -not simply clinical trial registrations which often provide minimal information. Despite more than 150 journals [125] now offering ATE in trials of ketamine for MDD is overestimated. It could be argued that midazolam is not a perfect active comparator from a theoretical perspective as it is a known anxiolytic, and anxiety is one of the sub-scales in commonly used outcome measures such as MADRS and HAM-D. That said, midazolam's anxiolytic effect may be relatively short compared to the potential antidepressant effects of ketamine. For serotonergic psychedelics several active placebos have been used although none appear to have been successful in maintaining masking. Tableshows one choice of active placebo that has been used is niacin. The use of niacin has historical roots from Pahnke's "Marsh-Chapel" experiment from 1962 where it was used as an active comparator for psilocybin. However, although niacin does create some warm flushing due to vasodilation, its use as an active placebo did not maintain masking in any of Pahnke's participants. Amphetamines have also previously been used as active comparators for LSD in the first wave of psychedelic research. Another potential approach is to use low doses of drug being studied. By themselves, active placebos may be insufficient to successfully maintain masking. We argue that active placebos may need to be used in combination with alternative trial designs alongside some mild deception/vagueness in the information about trial contingencies provided to participants. This would require careful discussion with Ethics Committees but may be permittable given careful scientific justification and debriefing strategies. In some scenarios in order to meet equipoise principles [129], patients who do not receive active treatment can be offered entry into an open-label extension arm. We next consider some clinical trial designs that have, and could, be used in psychedelic RCTs and comment on their appropriateness in this context. Designs are displayed graphically in Figure.
PLACEBO LEAD-IN PERIODS -NOT RECOMMENDED
A trial design that has been used in trials of standard antidepressants is the placebo lead-in design (Figure) which can be either single-or double-masked. In this design, all participants are initially treated with placebo. Placebo responders are removed from the trial (or moved into an open-label extension), while non-responders are randomised into the main phase of the RCT. In studies of standard antidepressants the use of participant-masked lead-in periods does not appear to modify the number of placebo respondersor size of the placebo response [131] although double-masked lead-ins with variable lead-in periods may be more effective. The placebo lead-in trial is built around the theoretical position that there are individuals who have a trait characteristic of being placebo responders, although the evidence for is relatively weak. In the context of psychedelic RCTs and the theoretical framework we propose, placebo lead-in periods might be entirely counter-productive. Placebo responders are the participants most likely to believe they have been given the active treatment and by Equation 14 are the very ones that we would like to retain for the control group (𝐵𝐵 𝑐𝑐 = 1) . Conversely, placebo non-responders may have deduced that the placebo is in fact placebo (𝐵𝐵 𝑐𝑐 = -1). Hence, this would end up maximising the difference between 𝐵𝐵 𝑐𝑐 and 𝐵𝐵 𝑡𝑡 which is the opposite of the desired belief situation.
SEQUENTIAL PARALLEL COMPARISON DESIGN (SPCD) -NOT RECOMMENDED
The SPCDis somewhat similar to using a placebo lead-in period. The SPCD RCT (Figure) consists of two intervention phases. Participants are allocated to either drug-drug, placebo-drug or placebo-placebo groups. Responders are removed after Phase 1, meaning that approximately half of placebo non-responders will then be switched to the active condition. Similar to the placebo lead-in period this is thought to reduce the placebo response but as argued for the placebo lead-in trial, the same issue with unequal belief attribution across groups arises. Moreover, psychedelic RCTs are mostly "one-shot" interventions rather than long medication trials which the SPCD is designed for. There may, however, be a role for the SPCD RCT in investigating the therapeutic effects of psychedelic microdosing.
CROSSOVER TRIALS -NOT RECOMMENDED
In our view, crossover trials such as the 2 x 2 crossover (Figure) are not ideal for psychedelic RCTs where treatment response is the primary outcome being investigated. Not only is there potential for carryover in the primary outcome measure, just as problematically there is the potential that the beliefs of participants around which intervention they have received could change as they move through trial periods and even within the first washout phase as intuitions around efficacy develop. Moreover, such a change could depend on which sequence arm they are allocated to. For brevity, we do not provide a full exposition of this which would require a lengthy introduction to the statistical approach used for crossover design (see). However, the complex interactions between carryover, belief allocation, efficacy, arms and sequence are likely to counteract any statistical efficiencies derived from the repeated-measures nature of crossover trials.
DELAYED TREATMENT DESIGN -NOT RECOMMENDED
In the delayed treatment design(Figure) all participants are told they will receive either drug or placebo, however the timing of the active treatment is not disclosed. Some participants receive active treatment immediately while the rest receive placebo. After a delay the active intervention is introduced into the placebo group. For this design to work in a psychedelic RCT an active comparator would need to be used to achieve some kind of belief allocation in the delayed treatment group. However, it would be very difficult in practice to balance the two groups in terms of number of interventions and visits and with an active comparator the design turns into an incomplete crossover design. A delayed treatmentdesign (with no placebo intervention) has previously been used in an RCT of psilocybin for depression. While this design does allow excellent characterisation of pre-treatment symptom stability in the delayed treatment group it does not address masking issues.
PARALLEL GROUPS DESIGN WITH ACTIVE COMPARATOR
A standard parallel groups design with an active comparator (Figure) may still be appropriate provided that the active comparator has convincing enough psychedelic properties that a reasonable patient given placebo would believe they had been allocated to the treatment. For example, take a trial where the treatment is ketamine and midazolam is the active placebo. Pre-testing might indicate that all participants in the ketamine group might correctly guess their treatment arm, while half of those given midazolam may believe they had been administered ketamine. In the RCT, participants could be allocated in a 1:2 ratio favouring placebo, although it is not necessary to state the allocation ratio to participants; indeed there may be advantages to mild deception for this factor. As per Section Assessment of Masking Success in Clinical Trials, soon after the intervention, but prior to outcome assessment, belief about the allocation intervention can be measured (B). Only those participants in the control group who believe they are in the active group (𝐵𝐵 𝑐𝑐 = 1) will be included in the statistical comparison to the active group in order to assess ATE (Equation). As noted above, sophisticated approaches are possible where there is a greater uncertainty in B as long as the uncertainty is not informative about actual treatment. One potential issue with this design is because the stratification on B is performed postrandomisation, one can no longer assume that pre-trial covariates are balanced.
DOSE-RESPONSE PARALLEL GROUPS DESIGN
In some implementations this can be conceptually similar to the parallel groups design described above where the active comparator is a low dose of the treatment. For example, the trial by Gasser et alinvestigating LSD for the treatment of anxiety used 200 μg as the active dose and 20 μg an active comparator -although in that trial masking was not maintained. The dose-response approach (Figure) depends on the validity of the assumption that the low dose is not in fact therapeutic. Given the increasing interest in microdosing of psychedelics for the treatment of mental health this may be of concern. As a cautionary tale consider the open-label trial of psilocybin for depression conducted by Carhart-Harris et al. After the first "small" 10 mg dose of psilocybin HAM-D scores dropped from 21.4 to 10.7 and after the full dose (25 mg) these dropped to 7.4. That is, 76% of the therapeutic effect on the HAM-D was elicited by the small dose. In the case where a dose-response RCT was perfectly masked then ATE might be underestimated if the low dose is eliciting a treatment effect. Nevertheless, given that we have argued all estimates of ATE in psychedelic trials to date are overestimated, an underestimated yet statistically significant ATE would be convincing evidence. One potential feature that we have not seen implemented for the dose response design is that it may be possible by deception or omission to not inform participants that the trial in fact contains two or more dose level groups. Obviously, this would require extra discussion with Ethics Committees to implement. It is noteworthy that the parallel group design with a different drug comparator would generally require, for ethical reasons, that participants be informed of the different drugs that they may receive. However, not informing a participant that they may in fact receive a lower dose may be permitted. An open-label extension may ameliorate issues related to clinical equipoise. There would need to be great care in the debrief given to participants -particularly making sure that future participants are not informed of the hidden trial contingencies -to avoid violating the SUTVA assumption. Probing participants' "belief" around allocation would also require care, as participants would not be aware that there were different arms in the trial.
PRE-TREATMENT PARALLEL GROUPS DESIGN
Similar to the parallel groups design, but here active treatment is preceded by a pharmacological antagonist (Figure). For example, the opioid receptor antagonist naltrexone has been found to block the antidepressant effects of ketamine [137], while lamotrogine blocks some of ketamine's psychedelic properties. For classical psychedelics, the 5HT 2A receptor antagonist ketanserin has been shown to block the psychological effects of both LSD [140] and psilocybin in a dose dependent manner -with 20 mg ketanserin leading to partial blockade and 40 mg near complete blockade. In a recent dose-response crossover design study of LSD in healthy volunteers by Holze et alparticipants were given LSD doses of 25 µg, 50 µg, 100 µg or 200 µg or 200 µg with ketanserin (40 mg). Participants were asked to retrospectively identify these conditions -after each session and at the end of the study. Accuracy was generally high indicating de-masking although notably the +ketanserin condition was most commonly mistaken for 50/25 µg of LSD. For a potential therapeutic RCT using pre-treatment, a design similar to the doseresponse design could be used, with dose modification of the pre-treatment rather than the active drug. In that way all participants would receive the full dose but not all would receive its full effect. We note that while participants may need to be informed that a pre-treatment is being used it may be ethically permittable to not disclose its experimental purpose. For example, ketanserin could be accurately described as "controlling blood pressure" during the intervention while concealing its real purpose; to limit the psychedelic experience. One advantage of the pre-treatment design is that it may allow for the examination of the impact of specific receptors on ATE generation. For example, in the case of LSD and psilocybin, ketanserin allows specific neurobiological inferences around activity at the 5HT 2A receptor.
BALANCED PLACEBO FACTORIAL DESIGN
The balanced placebo designis a 2x2 factorial design (Figure) with one factor of intervention (given drug | given placebo) and one factor of instructions -what drug they are told they will receive (told drug | told placebo). The four conditions are therefore a) the participant is given the treatment and is told they are receiving the treatment, b) the participant is given the treatment and is told they are receiving placebo, c) the participant is given the placebo and is told they are receiving placebo and d) the participant is given the placebo and is told they are receiving the treatment. Conditions b) and d) involve explicit use of deception. The balanced placebo design was devised to examine interactions between drug effects and non-specific effects such as expectancy. In the context of psychedelic RCTs, if coupled with an active comparator, the use of deceptive (or ambiguous) instructions may help to manipulate the participants' beliefs around which intervention they have received, when asked after the intervention and before outcome assessment. In particular, for group b) it may allow 𝐵𝐵 � 𝑡𝑡 < 1 which might be helpful if the active comparator alone is insufficient to create 𝐵𝐵 � 𝑐𝑐 = 1. In analysis, participants could be selected to allow 𝐵𝐵 𝑡𝑡 = 𝐵𝐵 𝑐𝑐 . The design itself is somewhat inefficient in the context of psychedelic RCTs as group c) is not particularly useful. That said, it is not necessary that a 1:1:1:1 allocation ratio be used. With regards to ethical issues around the use of deception, authorised deception can potentially be used where participants are informed that incomplete information is being provided. Open-label extensions could also lessen the ethical issues. One downside with factorial designs is that the extra conditions mean considerable numbers of patients would need to be recruited. Adequately powering such a study may be beyond the resources available to a single-study site.
ENRICHMENT FACTORIAL DESIGN
A somewhat similar design to the balanced placebo design is the 2 x 2 enrichment design (Figure) proposed by Carhart-Harris et al(see also). In this factorial design, one factor is the intervention (drug | placebo), and one factor is environmental enrichment (enriched | unenriched). This is an excellent design in terms of looking at which environmental factors might be used to enhance the effectiveness of psychedelic therapy. Taken alone though, issues of masking still remain. If combined with an active comparator then it may be possible to achieve 𝐵𝐵 𝑡𝑡. = 𝐵𝐵 𝑐𝑐. and could be a more elaborate version of either the parallel groups with active comparator design, dose response design or pre-treatment parallel groups design as described. Similar to the balanced placebo design, adequately powering such a study may be challenging.
STATISTICAL MODELS AND ASSUMPTIONS
In this section we consider a sequence of progressively more realistic scenarios and the analyses they imply. It will typically not be possible to estimate causal parameters precisely by relying simply on randomisation, but it is reasonable to expect the combination of randomisation and modelling to lead to reduced bias. 1. If Equationis satisfied for all participants, meaning that blinding is achieved by experimental design, the simple difference in means estimates an average treatment effect and the standard analysis is valid. When B is measured it will still often be of interest to estimate ATE(b) and examine the extent to which the treatment effect varies according to which treatment the participants believe they have received. Simplifying to a linear model (for exposition), the observed data can be modelled as: 𝐸𝐸[𝑌𝑌|𝑆𝑆, 𝐵𝐵] = 𝛿𝛿𝑆𝑆 + 𝛼𝛼𝐵𝐵 + 𝛽𝛽𝐵𝐵𝑆𝑆and ATE(b) estimated as 𝛿𝛿 + 2𝛽𝛽𝛽𝛽. 2. If Equation 14 is satisfied only for a subset, confounding has been induced by conditioning on the observed value of B, which is affected by randomisation. Specifically, typical participants will have B t =1, and so B=1 in the treatment group, but only the more easily influenced participants will have B c =1 and so B=1 in the control group. We have only the same statistical model to estimate ATE(b), but a causal interpretation is no longer guaranteed simply by randomisation. It is still possible to average ATE(b) over different values of b, but weights for averaging must be chosen based on the distribution of B in the treatment or control group or a combination of the two. 3. To the extent that B c for an individual is dependent on factors such as expectancy and suggestibility, it will be possible to replace the conditioning on the post-randomisation B with conditioning on pre-randomisation measurement of these factors (E). A linear estimation model would then be: This model conditions on post-randomisation B, which will typically induce bias, but the association between Y and B will be weaker in this model than without E.
CONCLUSIONS
In summary, this paper has attempted to outline the fundamental logic for how RCTs are used to establish causal relationships between treatments and outcomes using the formalism of the Rubin causal model and how standard RCT designs may be inappropriate in trials of psychedelic treatments. During these trials, a number of factors potentially allow participants to experience strong expectancy regarding intervention effects. When these strong expectancy effects are coupled with de-masking the potential for treatment effects (ATE) to be over-estimated in psychedelic RCTs is substantially heightened. To counter this problem, we have suggested several approaches that could improve the evidence for causation. In particular, psychedelic RCTs should include measurement of expectancy using tools such as the CEQ or SETS, assessment of masking following psychedelic interventions using standard questions and indexes and, where relevant, patient-therapist alliance should be measured throughout the trial. Out of the nine considered RCT designs, five showed promise to be used (some with modification) in psychedelic trials. Special care should be given to the instructions and information provided to participants -which can shape their expectancy and perception of masking during the trial. We suggest augmenting current reporting standards; including factors such as the culture/context of the trial along with publishing PIS/advertising materials and any other instructions that are provided to participants. Finally, more psychedelic RCTs should employ full pre-publication of trial protocols. This paper has limited its scope to the role that RCTs might play in establishing a causal role for psychedelics in treating psychiatric diseases. However, it is important to acknowledge that other forms of evidence are used in medicine to establish causation. These could be considered in future work; for example the application of the Bradford-Hill criteriaor the potential for surrogate biomarkers (such as blood or neuroimaging biomarkers) to be used in order to strengthen evidence for causality. Indeed, one shortcoming of relying solely on clinical trial and ATE estimation from a scientific perspective is the lack of any consideration for mechanistic credibility. The philosopherswrite "We suppose that causes do not produce their effects by accident, at least not if you are to be able to make reliable predictions about what will happen if you intervene. Rather, if a cause produces an effect, it does so because there is a reliable, systematic connection between the two, a connection that is described in a causal principle" (Ch2, pg 8). Psychedelic science is making rapid advances in understanding the mechanistic basis by which psychedelics exert effects at both molecular-and systems-levels. In the future integrating mechanistic information into purely statistical considerations of treatment effects may help to establish a firm scientific foundation for if, and how, psychedelics exert therapeutic action.
Full Text PDF
Study Details
- Study Typemeta
- Populationhumans
- Characteristicsliterature review
- Journal
- Author