When we (rigorously) measure effectiveness, what do we find? Initial results from an Oxfam experiment.

Guest post from ace evaluator Dr Karl Hughes (right, in the field. Literally.)Karl Hughes 3 Just over a year ago now, I wrote a blog featured on FP2P – Can we demonstrate effectiveness without bankrupting our NGO and/or becoming a randomista? – about Oxfam’s attempt to up its game in understanding and demonstrating its effectiveness.  Here, I outlined our ambitious plan of ‘randomly selecting and then evaluating, using relatively rigorous methods by NGO standards, 40-ish mature interventions in various thematic areas’.  We have dubbed these ‘effectiveness reviews’.  Given that most NGOs are currently grappling with how to credibly demonstrate their effectiveness, our ‘global experiment’ has grabbed the attention of some eminent bloggers (see William Savedoff’s post for a recent example).  Now I’m back with an update. The first thing to say is that the effectiveness reviews are now up on the web.  Here you will find introductory material, a summary of the results for 2011/12, and some glossy (and hopefully easy to read) two-page summaries of each effectiveness review, as well the full reports. (You may not want to download and print off the full technical reports for the quantitative effectiveness reviews unless you know what a p-value is. With the statistically challenged in mind, we have kindly created summary reports for these reviews, complete with traffic lights….).  Eventually, all the effectiveness reviews we carry out/commission will be available from this site, unless there are good reasons why they cannot be publicly shared, e.g. security issues. Plug over, I can now give you the inside scoop.  In the first year (2011/12) we aimed to do 30 effectiveness reviews, and we managed to pull off 26. Not bad, but our experience in the first year made us realise that our post-first-year target of 40-ish reviews per year was perhaps a bit overly ambitious.  We have now scaled down our ambitions to 30-ish, to both avoid overburdening the organisation and enable better quality control. The issue of quality control, in particular, is critical because there are certainly opportunities to strengthen the effectiveness reviews, particularly in terms of rigour.  Currently, there is considerable interest in how to evaluate the impact of interventions that don’t lend themselves to statistical approaches, such as those that are seeking to bring about policy change (aka “small n” interventions).  See a recent paper by Howard White and Daniel Phillips.  We have attempted to address this by developing an evaluation protocol based on a methodology called process tracing used by some case study researchers.  However, we are struggling to ensure consistent application of this protocol.  Time and budgetary constraints, as well as inaccessibility of certain data sources, are – no doubt – key militating factors.  Nevertheless, we aim to improve things this year by more tightly overseeing the researchers’ work, coupled with the provision of more detailed guidelines and templates so they better understand what is expected. While in no way perfect, we have perhaps had more success with the reviews of our “large-n” interventions, i.e. those targeting large numbers of people.  This is, at least in part, because we are directly involved in setting up the data collection exercises, and we carry out the data analysis in-house.  The key to their success is capturing quality data on plausible comparison populations and key factors that influence programme participation, and this has worked out better in some cases than in others.  We are also attempting to measure things that just aren’t easy to measure, e.g. women’s empowerment and ‘resilience’.  We are modifying our approaches and seeking to collaborate with academia to get better at this.  Despite their shortfalls, at £10,000-ish a pop (excluding staff time), we believe these exercises deliver pretty good value for money. Humanitarian programming is not my thing, but I am particularly pleased with the humanitarian effectiveness reviews that critically look at adherence to recognised quality standards.  While there are some methodological tweaks needed here and there, the cohort of reviews presents an impartial and critical assessment of Oxfam’s performance and identifies key areas that need to be strengthened, e.g. gender mainstreaming. So what do the effectiveness reviews reveal about Oxfam’s effectiveness?  While the sample of projects is too small to draw any firm conclusions, the results for this particular cohort of projects are – as one might expect – mixed. For most projects, there is evidence of impact for some measures but none for others. LA 134510.jpgThere are, no question, some clear success stories, such as a disaster risk reduction (DRR) project in Pakistan’s Punjab Province.  Here, the intervention group reported receiving, on average, about 48 hours of advanced warning of the devastating floods that hit Pakistan in the late summer of 2010, as compared with only 24 hours for the comparison group.  Having had more time to prepare is one possible explanation why the intervention households reported losing significantly less livestock and other productive assets.  Oxfam’s research team is in the process of commissioning some qualitative research to drill down on this project to better understand what made it work. Given Oxfam’s size and capacity to mobilise and make noise, it is no surprise that there is reasonably reliable evidence that many of the campaign projects have brought about at least some positive and meaningful changes, despite falling short of fully realising their lofty aims.  However, the results for several of the sampled livelihoods and adaptation and risk reduction projects are, quite frankly, disappointing.  Figuring out why these particular projects have not worked is just as critical for learning as is figuring why the Pakistan one did. Whether their findings are positive or negative, I have to admit that I am impressed with how seriously the effectiveness reviews are being taken by senior management.  A management response system has been set up and embedded into the management line, where country teams formally commit themselves to taking action on the results. That being said, the effectiveness reviews are in no way immune from internal controversy.  The random nature of project selection is perhaps the biggest sticking point.  While we do this to avoid ‘cherry picking’, inevitably some of the projects that are selected are small-scale and have little strategic relevance to the countries and regions.  Some are also concerned about how much time and resources the effectiveness reviews are sucking up. We know that what we are attempting to pull off can be improved on a number of fronts, in terms of rigour, learning, and engagement and ownership of country teams.  And the good thing is that we are able to modify and improve things as we go along.  So any constructive criticism, advice, etc. is most welcome.]]>

Subscribe to our Newsletter

You can unsubscribe at any time by clicking the link in the footer of our emails. For information about our privacy practices, please see our .

We use MailChimp as our marketing platform. By subscribing, you acknowledge that your information will be transferred to MailChimp for processing. Learn more about MailChimp's privacy practices here.


19 Responses to “When we (rigorously) measure effectiveness, what do we find? Initial results from an Oxfam experiment.”
  1. John Magrath

    Karl, this is a great initiative and fascinating. The 2-page summaries – I particularly like the “traffic light” system – are excellent. Hopefully this will inspire others in the sector, not only to put evaluations out there more sytematically, but to make them more usable (are you listening e.g. EuropAid…..?)

  2. ex aid worker

    The European Commission (EuropeAid) had a similar system called ROM, Results-oriented monitoring, which monitors a large proportion of projects every year. It would be interesting to compare the two.

  3. Thanks for sharing all this information on the Oxfam experience in implementing effectiveness reviews. Developing reviews that are useful and cost effective to implement is a hard. Making your reviews and thoughts about reviewing available for others to learn from contributes significantly to improvements in this area.

  4. James Stevenson

    Hats off Karl. This must have taken a tremendous amount of hard work and careful “upward management” to get this institutionalised. This is a big break-through in terms of increasing the % of projects that are evaluated to a reasonable standard of rigour.
    Wonk digression… the propensity score matching (somewhat) circumvents the need to spend money on expensive baseline surveys. However, selection on unobservables might be a nagging concern for certain projects. But you seem to have all of that on your radar. Impressive.

  5. Thanks all for the comments and tweets so far. And James, the points you raise are spot on: For the quant. evaluations, the econometric techniques we use are no magic bullet. If we had decent baseline data on these projects from both the intervention and comparison groups, we could integrate PSM, regression, etc. the with difference-in-differences design. This would probably be the next best thing to having experimental data. But in almost all cases we are not in a position to compute proper dif-in-dif estimates.
    However, we do try to get respondents to recall baseline data, particularly data we assume can be reliably recalled. Asset ownership and other indicators of household wealth status, e.g. floor type, is probably our best example of this. We ask respondents, for example, whether they have a bicycle both now and at baseline. (We try to jar their memory with historical markers, e.g. an election that took place the same year the project started.) This allows us to construct a baseline household wealth index, thereby, enabling us to balance the two groups in baseline poverty status. We further difference each asset/indicator by time period and then run principal component analysis on the difference. We then get an estimate of how the poverty status of both the intervention and comparison groups has changed since the baseline. We have to be careful, however, that this change is not simply being driven by things that have been handed out by the project, e.g. farming tools or livestock.
    Now if we could just get programme teams to collect quality baseline data on both intervention and appropriate comparison populations, we would not have to go through all this fuss! But, as you are well aware, this is more easily said than done. In the meantime, we are trying to do push the boundaries on what is possible in the context of a single difference impact evaluation design. And, while the results may not be watertight enough to inform policy and/or further scientific knowledge, in most cases they are pretty good for organisational learning. What we are doing is meant as a complement, not a replacement, for the more involving (and costly!) impact evaluation work al la 3ie, JPAL, IPA, etc.

  6. We at Room to Read have begun an evaluation using the process tracing methodology – inspired by the Howard White and Daniel Philips paper. The jury is still out for me as to whether an effect can be attributed to programs with this methodology. Would love to learn more about your experience with this methodology and where the challenges are. We are learning that the data collection required is quite involved so it’s understandable that you have found this difficult.

  7. B. Osborn Daponte

    It is good to see an attempt at rigor. I have a few general comments/suggestions, which you probably have already considered internally.
    Rather than “randomly” selecting interventions to evaluate, perhaps be more strategic about the selection. Consider a stratified random sample. What the stratification criteria would be is TBD, but consider things like the importance of the project, the likelihood that it may be replicated, budget size, project complexity, ….
    Second, it seems that this approach does not allow for mid-course corrections, thus decreasing a potential value of evaluating the intervention. Consider whether creating a quasi-experimental design could also be done at stages of the project other than the final stage.
    I applaud your efforts and willingness to share with the evalaution community!

  8. Martin

    This is a great initiative, well done to those involved.
    I may have missed it, but I did not see reference to the costs of the projects in the evaluations. Was there any cost/benefit analysis undertaken?

  9. Mary Healy

    Very interesting, thanks for sharing. I’d love to see the total costs presented, INCLUSIVE of staff time, as this is the real cost to Oxfam and to any possible donors.

  10. First, let me say that this is brilliant. I like the focus on reporting both effective and ineffective results. Further, the format of short reports backed by longer and more rigorous reports makes the information both digestible and trustable – a rare combination in this arena.
    A quick piece of feedback. It would be very useful in the 2-page summaries to have a bit more information about the project – specifically the budget and timeline. For instance, the project review for the Copperbelt Livelihood Project in Zambia (http://ow.ly/eoTKe) reports that: “Women in the intervention village were found more likely to own at least one strategic asset.” To properly interpret this it would be very helpful to know, at a glance, what the duration of the project was, and what the elapsed time was between the end of the project and the evaluation.
    Regardless, this is a very good start, and thanks again for sharing publicly.

  11. Luc Lapointe

    Thank you for taking on this initiative and I look forward to read the report.
    In an era where we are trying to break silos and where more actors are engaged in aid delivery / development, I hope that we will soon see efforts to measure collective impact. The success of your programs depends on a multitude of factors and organizations that are most likely not associated with you. Your success (impact/outcomes) will greatly increase when we measure the collective impacts in the context of cooperation/collaboration and strengthening partners for greater effectiveness.

  12. Lee

    Interesting stuff. In line with B. Osborn Daponte’s comments on using a stratified sample rather than a totally random sample – you might want to consider some kind of filter.
    DFID issued guidance recently which lays out some criteria for which projects should be evaluated:
    “It is not necessary, or desirable, to evaluate every programme, but the decision needs to be clearly justified. Many DFID country offices have developed evaluation strategies identifying criteria for deciding if an evaluation is needed. Common criteria include:
    – a weak evidence base,
    – a contentious intervention,
    – stakeholder interest,
    – an innovative or pilot programme, and
    – a high financial value.”

  13. Great to see Oxfam trialling new approaches and doing it so transparently. As other NGOs, we’ve a lot to learn from your method and findings. Thank you for being so open.
    I’ve just blogged about the reviews: http://ngoperformance.org/2012/10/17/hats-off-to-oxfam-but-are-they-asking-the-right-question/
    I strongly support your experiment and transparency. The blog mentions six main reactions and a couple of questions: (a) which decisions & actions will be most influenced by the reviews, and (b) how do the reviews help field staff do their jobs better?
    I’m looking forward to continuing to work on these issues together – they are such a high priority for the sector as a whole.

  14. Hats off indeed! Hats off also to Alex Jacobs, whose post raises the questions that have gone through my mind. I have an extra question: Why does Oxfam select the programmes to be evaluated by random sampling? One could argue that it is the most “neutral” or easiest option – you don’t need to think about the criteria you would use to select interventions. But if the key purpose is to generate learning within the organisation and within the wider sector, wouldn’t it be more cost-effective to select the projects for evaluation in a more targeted manner? For instance, if you wish to find out about the effectiveness of a specific theory of change in a specific sector, you could organise a set of case studies in a range of countries. A more purposive approach would also enable you to limit RCTs to those kinds of projects where they make sense. That is not just a matter of numbers – many development interventions are just too complex for RCTs to yield meaningful insights. I have posted a couple of interesting presentations on that issue on my blog: http://www.developblog.org/2012/10/evidence-of-what-worked-at-some-point.html and http://www.developblog.org/2012/05/participatory-statistics-and-more.html
    Meanwhile, many thanks to Oxfam for advancing the debate and research on evaluation in development by making the full reports publicly accessible!

  15. Congratulations, this is a truly huge piece of work. At World Vision International we have also been experimenting with different approaches to tackle the effectiveness question and the evaluation challenge. We are taking a different approach – a three pronged one – so I was really interested to read about Oxfam’s approach to this Effectiveness Review, and your successes and challenges.
    Firstly we are piloting ‘Annual Summary Reports’ with country offices, based on existing M&E data (we do collect baseline data). These are intended to be simple reports with strategic relevance, focused on internal learning for improving decision making at country office level for programme effectiveness.
    Secondly, at the global level, to produce a meta-review of these reports, but quality issues are also a challenge. The first pilot report was released internally in September. I hope that by next year we will also be releasing externally as Oxfam has.
    Thirdly, we are working with an academic institution on an impact study with comparison groups in three countries, focused on health and nutrition. We are hoping this will tell us more about ‘impact’, but again, results will take time.
    I am particularly impressed with a few less obvious points in your work: overcoming internal controversy and challenges (no small feat!); posting the management responses online and the overall 2-page summary. Simplicity is the ultimate sophistication. Again congratulations on this mile stone. It has given us more food for thought. I look forward to next year’s report!

  16. Daniel

    How can you say that 40 x £10,000 was good value for money without knowing what action the country teams actually took….that deifnes the value of the investment – ie, the objective of all this effort?