Before and After Evaluation
What is before and after evaluation?
A before and after evaluation compares the summary outcome measure scores of a group of participants gathered before and after they have used a service or participated in a service. This usually involves using descriptive statistics (such as a mean or median value) to summarise scores for a whole group of participants.
Why use before and after evaluation?
- Whilst an implementation evaluation focuses on how an intervention was delivered (e.g., fidelity, reach), a before and after evaluation describes a participant’s outcomes. A before and after evaluation can tell you:
- Whether your selected outcome measures are acceptable and appropriate for further evaluation. Such as whether participants complete all the items on your selected measures, and if there is some variation in participants scores on the measures.
Whether the intervention has possible evidence of promise. This is evidence that suggests that an intervention may plausibly be working effectively to improve outcomes.
NOTE: It is important to note that a before and after evaluation cannot tell you if the intervention is having a causal positive impact on participants outcomes.
This is because it is impossible to know what would have happened to participants without the intervention taking place. In the case that you see an improvement in scores; it is possible that average scores may have naturally improved over time even without the intervention. In the case that scores stay the same, it is possible that these scores would have worsened without the intervention. To truly know if an intervention is having a causal positive impact, an effectiveness evaluation with a control group is needed.
How to conduct before and after evaluation in-house
The first step for a before and after evaluation is to select a validated measure of the outcome that you are attempting to target with your service (see video ‘Evaluation Questions and Outcomes’ from 15mins for more on valid outcome measures). A measure of an outcome usually has multiple items (questions) regarding a particular construct, and scores are usually added together from these items to create a summary score.
You could use a measures database to find your outcome measure, some examples of measures databases are the Education Endowment Foundation’s Early Years measures database, and the Child Outcomes Research Consortium. If you are working with an external researcher, they may be able to support you to select an outcome measure.
Although it may be tempting to create your own outcome measure to administer to participants, we do not advise this, as the development of valid and reliable outcome measures requires knowledge and expertise in psychometric theory and statistics. Outcome measures should also not be significantly adapted or changed, such as dropping or rewording questions, as this will compromise the meaning of those outcome measures.
Once you have administered your outcome measures to participants before and after the service, you can create summary scores from the questionnaire for each participant. You should follow rules about how to calculate summary scores from the outcome measure developers or manual itself, paying attention to any reverse-scored items, and carefully considering how you handle missing data on items in the questionnaires. This work should be done using appropriate software and without changing the source data, as described in interpreting and analysing basic data, so that you can check your original data against the new summary scores that you have created.
You can then summarise scores for your group of participants before and after the service by using descriptive statistics. We suggest the mean as a summary statistic for many measures, but you may wish to consider if the mode or median score is more appropriate depending on your selected outcome. You can summarise any differences in those scores, by taking the sum of the ‘after’ score away from the ‘before’ score. The example below shows where we have summarised scores in two measures relating to mental health, before and after participation in a service.
The example above describes the result as being statistically significant. A test of statistical significance, usually derived via estimating a statistical model of your data, can be conducted to provide you a p-value: which is the probability of observing the sample data, or more extreme data, assuming the null hypothesis is true. The null hypothesis is a model of the data that is expected if there is no effect. This test of statistical significance is not necessary for a before and after evaluation. We strongly suggest involving an external researcher if you would like to conduct any statistical significance tests or models of your data – as deriving and interpreting them requires statistical knowledge and expertise.
The example also shows how comparison to clinically meaningful differences can be helpful for contextualising results. Whilst the example shows that we find differences of 1.3 and 1.1 points in our measures, we compare this to a clinically meaningful difference in these measures – which previous research has shown to be 2 points. Clinically meaningful differences will not exist for all outcome measures, but they can provide useful context when available.
Remember that even if you find a statistically significant and clinically meaningful difference, this does not mean we can conclude the service is effective or ineffective – an effectiveness evaluation with a control group would be necessary to make this conclusio