Prediction & Predictability

Racehorse Handicapping: Predicting the Unpredictable?

The role of a horseracing handicapper is to ensure that each horse in a race is carrying enough weight to offset their differing capabilities and their varying levels of form.  It’s seen as a vital task because it means that, in theory at least, champion horses in the peak of their form are matched more evenly with their less illustrious competitors, ensuring a more tightly contested, less predictable race. 

Taking the logic to its natural conclusion, the handicapper will only have done their job correctly if all horses in a race cross the line at the same time.  While it’s possible (but still unusual) to have a dead heat in a two-horse or, in extremely rare cases, in a three-horse race, it’s functionally impossible for this ever to happen in a race involving a larger number of horses.  

Famously, the Grand National is never a close race, using the definition of closeness as the difference between first and last places – indeed many horses fail to complete the course each year and the favourite rarely wins.  There are just too many horses, too many obstacles, there is too much distance and arguably, there is too much that is unusual about the preparation to ever confidently hope to call a winner, let alone be able to harmonise the finish across the whole field.  In probability terms, there are simply far too many unknown variables to trust any form of predictive modelling that would ever enable a handicapper to achieve the ‘Holy Grail’ of all horses crossing the finish line at the same time.  In the face of such overwhelming statistical evidence to suggest its basic futility, why is handicapping necessary?

The answer is that ensuring a dead heat is not the point of handicapping at all.  Handicapping is there to offset perceived differences in horses’ abilities and form.  It acts as a regulator for betting, ensuring that favourites will not be favoured by the betting public by as wide a margin and that ‘dark horses’ will be viewed less darkly than they would be without handicapping.  It serves the industry behind the sport, not the sport itself.  There is no handicapping in Athletics purely because the sport exists primarily as a discipline to discern which athlete is the fastest (and by how much).  Only the overlay of betting leads to the necessity of handicapping – something which many might see as a perversion of the conventions of pure sport.

Uncovering the ‘real’ reason behind the point of handicapping seems rather dull, irrelevant and perhaps even a little dispiriting but the subject is still of value because it acts as an interesting analogy that mirrors the issues of what can and what perhaps can’t be predicted – and  to what extent, the distinction between the two states may become blurred.

Direct Marketing & Parallels with Racehorse Handicapping

The role of a Direct Marketer is to predict, accurately, the event of each customer choosing to make a purchase from an offering in a given time-frame – or not, as the case may be.  As with handicapping, various models exist to discern the factors that most affect future behaviour.  As with handicapping, these models are widely accepted as being able more reliably to predict the general level of behaviour than would otherwise be the case.  As with handicapping, there are far too many variables to translate such improvements at the individual level.  At this point, even the offer to give away £1,000 of vouchers with every £10 order will still only yield a certain percentage of response – it will not motivate every customer into action, often for a variety of what appear to be illogical reasons.

It may be suggested that the ‘Holy Grail’ of Direct Marketing is just as simple and just as unobtainable as the race where all horses cross the line together.  It is an activity which is segmented using a profile which can determine only those customers who will order.  

In reality, for this to occur, not only must this segmentation yield a 100% activation rate for the successful segment, but it must also be shown that all other segments will always yield a 0% activation rate – a practical impossibility.  

Just as a handicapper may occasionally achieve a 2-way dead heat, a Marketer may occasionally achieve a 100% activation in a segment with a very small sample.  In that circumstance, the Direct Marketer’s expectation is always that that offer, made more broadly, must be transferrable to other segments, uplifting their performance.  The activity is then repeated through various other segments with the expectation that it keeps performing profitably until it fails.  In short, the ultimate goal state of a Marketer can therefore never happen, as another sale can always be found.

Even if a model existed to find just the people who would only ever respond to a given stimulus (its magnitude), it would still be akin to believing “this is all the sales you can ever make”.  It would be perfectly efficient, of course but it doesn’t necessarily mean that revenue is increased by all that much.  It just clarifies the process of when to stop chasing the extra sales.

In reality, this a problem we’re highly unlikely ever to face.  Customers are people and people are (at the individual level) incredibly difficult to predict.  The ‘Holy Grail’ state just shows us what a perfect level of predictability would look like, which is useful when it comes to comparing and evaluating our own methods.

Applying a Predictive Model in Direct Marketing

As a contrast to the imaginary problem above, real-world examples of response rates across the segments of an activity tend to adhere to a more familiar principle: the law of diminishing returns.  

This is taken from campaign data from a previous Spring/Summer campaign, using segments driven by our prior ‘Points Analysis’ method of segmentation and recorded from response codes given during telephone orders.  For this reason (as it therefore ignores web orders from that campaign), the percentages are not relevant here, just the shape of the curve.  

As with the ‘Holy Grail’ curve above, it starts off steeply, implying that this is a clear way to predict the responsiveness of one group over another.  However, as the trendline (I’ve used a logarithmic trendline, by the way) progresses along the segments, it flattens so that by the lower segments, it almost represents an admission that the model can’t really say if the second to last segment contains significantly more predictable customers than the last segment.

Using the ‘revenue-building’ logic discussed above, this uncertainty can be (and often is) presented as a positive feature.  As long as the responsiveness is at a profitable level, this ‘long tail’ becomes something of an asset, as it assures the Marketer that more sales can be added, with a positive ROI until the point on the axis where the curve touches the break-even point of response.  The fact that these sales happen to come with decreasing levels of efficiency may be seen as a price worth paying.

One rather fundamental problem in the collation of the above chart was that the response metric was based on order-level, not customer-level responses.  At this point, we need to be rather pedantic: the issue of predictiveness relies ultimately on the response of an individual to a stimulus, which is then grouped by the segments of similar individuals.  Using the principles of RFM (the categorisation of customers by Recency Frequency and Monetary Value), order-level analysis conflates the effects of both R and F, when we require them to be viewed in isolation. To illustrate this point, consider that one hundred orders from a given segment may imply one hundred responding customers but it could in reality translate to just one very responsive customer – or any combination of reciprocal factors between.

Since then, we’ve adopted the more standard Binary segmentation model, which ensures the monitoring is at the customer-level, preferring the percentage metric ‘Activation’ (customers who ordered in a given season as a percentage of customers stimulated, by category) over the more traditional, order-level metric ‘Response Rate’ (orders received using a given response code as a percentage of catalogues circulated with that media code).  The uncertainty factor of one customer ordering a hundred times versus a hundred customers ordering once each has been subsequently removed.  We can now monitor precisely how many customers have ordered, as well as the number of orders those customers have placed, collectively and individually.

The Activation performance of the Binary list for the most recent Spring/Summer campaign, expressed for each group shows a similar curve, implying the same adherence to the law of diminishing returns as the older Points Analysis-derived curve above.

Once again, the asymptotic (flattening) curve implies a longer tail beyond the limits of the mailing list, which, using the methodology of the Binary process (with its allocation of decreasing points for customers ordering increasingly further back in time), also implies that further revenue can only be attracted at a less efficient rate.  In effect, it’s almost telling us that after a certain point, we can mail anyone using this rationale and we’ll probably get the same return, whatever it is.  This is hardly what you would call a predictive model.

All this is implied but none of it can be taken for granted, just as no segment that yields 100% Activation ever implies that the ‘Holy Grail’ has been achieved – there is always the question “what further potential is there?” to answer.  It’s clear that we need other means of predictiveness to unlock the secrets of the deeper recesses of our mailing list.

The Limitations of the Binary System

Largely as a result of the paranoia/healthy scepticism (call it what you will) of putting all our eggs in the basket that is Binary segmentation, we have, since adopting Binary, also endeavoured to add a wider pool of customers to our recent mailings selections than merely those segments suggested by that system.  It’s not unusual or ground-breaking to do so; it’s a practice that’s routinely done by even the most faithful proponents of Binary segmentation and it’s called deep-diving.

 Using our previous (semi-proven) Points Analysis system as our deep-dive axis, we mailed representative samples from these deeper segments of customers and named them groups -1 to -5, in accordance with the Binary nomenclature.  

What we found was that a huge proportion of the -1 group customers were activated (far more than we had anticipated), the equivalent of the Group 12 Binary segment, i.e. the best segment of the ‘Good’ portion of the list.  Thereafter, the activation rate dropped massively for the -2 segment and continued to tail off gradually through to the -5 segment. 

Perhaps it should come as no real surprise that there is a significant increase in activation in any Binary analysis from the 1 segment to anything that is essentially the ‘best of the rest’.  I have to presume that a known increase in activation at this point in the list is not only common but probably also a phenomenon that is to be expected.  Conversely, I have no idea if the level of disparity at this point is generally as great as we have found it to be.  I rather suspect it isn’t.

There are two benefits to this figure being so notably high, which represent the twin roles of predictive segmentation I have already outlined.  Firstly and most prosaically, it represents almost 7,000 activated customers and almost £300,000 of additional revenue.  Secondly, it gives us a definition of customer type that we know we can continue to stimulate efficiently and it strongly indicates at what point this metric provides segments that are inefficiently stimulated.  It also calls into question the wider viability of a system that seems to ignore a cohort of customers who are capable of yielding half as many activations as those it selects.

Ordinarily, as the Direct Marketing wheel turns and the results of one campaign’s test shape the standard practice in the next campaign, thoughts turn to the question of what methodology to test next.  With such a statistical disparity as this, it’s also difficult to escape from the conclusion that the Binary model as it stands may not be wholly suitable for our requirements.  This is not to say that the practice hasn’t been worthwhile or indeed that the notion of measuring campaign performance at the customer level isn’t of value.  In fact the opposite is true: With ever more  ordering methods, media codes as a means of recording performance are dying and, even if we could resurrect them, we would return to the same non-relational order-level analysis that tells us nothing about the customers on whom our business depends.

I would always advocate a customer-level metric, even if I might always wish for a method of segmentation that is more clearly suited to our list profile.  The reporting disciplines required and indeed the limitations that customer-centricity can have on budgeting for additional in-season activities are all, in my view, a small price to pay for the insight the analysis can give to actual customers.  As we move inexorably to a more sophisticated multi-channel interaction-based data model which encompasses customers’ web visits, email responses, retail transactions and even social media activity, it is clear that our basic ‘currency’, the only differentiating factor we have, to analyse anything of significance will eventually (and then always) be at the customer level.  

Having said that, if we’re at the point of re-drawing the boundaries of what constitutes ‘very good’ customers from ‘good’ and so on, we can also have an eye on what shape of curve we’d like it to produce, based on recent customer behaviour tracked against information known about customers before that activity occurred.  As I have already outlined, the process of measuring the performance of an activity has two basic roles: to assess both its magnitude and its efficiency.   A curve that simply emphasises the magnitude of success is too steep and does little to imply where further success might be found.  A curve which places too much concentration on efficiency tends to be too horizontal and very quickly can become practically non-predictive.

Obviously, there will always be customers who are more responsive than others in any database so it’s true to say that any curve will show degradation.  In fact, as it’s a symptom of a correct profiling methodology, activation curves should have a degrading, downward-sloping shape from customers who are predicted to be the most responsive, down.  It’s also fair to presume that if you measure a list against any given single metric, there will always be a ‘best of the rest’, chosen using a different metric which may out-perform the usual list, so at some point a secondary or even a tertiary segmentation metric should be considered.  A problem can occur if those segments suggested by other metrics out-perform the primary-metric segments by too much.  This may imply that a better, more appropriate primary profile would have included those names in the first place, something which would ensure the risk of missing such customers from a future campaign is minimised. 

An Easy Win: Challenging the Timeframe

One way to improve the primary metric we have (Binary) may be to re-define that timescale of the selection.  The version of Binary that we’ve adopted is based on Yes/No (or 1/0, hence the name ‘Binary’) classifications for a customer’s ordering profile over each of the last four six-month seasons.  It is entirely predicated on the fairly standard assumption that a customer is a customer from the date of their first order until exactly two years beyond the date of their last order.  By extension, anyone on the list who hasn’t ordered for over two years must be considered a lapsed customer and is removed from the house list.  They may continue to be contacted, but only as part of a reactivation programme.  

The fact that the Binary system is based on a two-year model and the fact that it was adopted by ‘mainstream’ catalogue operators such as Littlewoods and La Redoute seems to have a fair degree of compatibility.  I have always been (and remain) dubious that the simplistic ‘two year rule’ applies as strongly in a niche market such as our own.  As a ‘safety net’ against pinning our performance on adhering to it, I ensured that our mailings included a ‘best of the rest’ deep-dive, based on high point-scoring customers (who would therefore have been mailed under our previous segmentation model), who, being outside of the Binary segments would therefore have been inactive for over two years.  

As we have seen from the most recent data, this 30,000-deep segment yielded a response (and therefore a Return on Investment) performance, similar to the ‘12’ group in the standard 4-season Binary model.  Evidently, our less Recent, more Frequent and/or higher Monetarily-valuable individuals were able to outperform most of their more Recent counterparts.  The cut-off at two years has always seemed arbitrary and inappropriate for us – and these figures appear to support that position.  Recency is therefore not necessarily ‘king’ in a niche market, even if it may be considered as such by more mainstream operators.

To corroborate this view, perhaps it’s helpful to contrast the characteristics of a mainstream proposition and a mainstream customer with those propositions in a more niche market context.  

Mainstream v Niche: Some Observations

Mainstream catalogue companies have tended to define their core markets more by the way they choose to buy (i.e. by choosing not to walk into a shop) far more than by the type of products they buy.  They are in competition with a far wider section of the market, selling standard products to a broad section of the public.  Light fittings, pyjamas, holiday footwear and all the other day-to-day offerings were always generally available on any high street or in a plethora of other catalogues or websites, in which there is usually massive competition.  It is therefore difficult for them to create a sense of what their brand represents beyond their pricing, the quality of their merchandise and their service – certainly no-one can define their range as a whole as representing and supporting a ‘lifestyle choice’.  Even before the further commodification of retail by search engine and affiliate sites, their offering was often close to being commodified by the presence of so much competition.

It is easy (and perhaps fair) to conclude that they must therefore adopt a ‘plenty more fish in the sea’ approach to customer retention over acquisition.  If customers are that easily acquired, and if retention can prove to be so difficult, it follows that it is seen as far easier to entice a new customer than it is to win back one who has not been back for a relatively short amount of time. It’s dangerous to suggest they acted arbitrarily in arriving at two years as the determinant of dormancy; it seems reasonable to expect that it was driven by their data, suggesting a parameter that was appropriate for their purposes.

Conversely, niche market businesses tend to define their customers by a specific activity or affinity, which is to a greater or lesser extent important to all of their customers.  They may find that the percentage of customers willing to buy remotely in that market is far higher than in general (historically) because of the relative lack of credible alternatives.  Broader ranges of products that appertain to that activity or affinity may be more difficult to build, depending on the obscurity or the scope of that activity or affinity.  Wider competition will always be present but, at their strongest, these niche markets are filled with customers who define their interest as a ‘lifestyle choice’.  These brands do not just purvey goods, they represent or even define a lifestyle. 

In a niche market, almost by definition, there aren’t quite so many ‘other fish in the sea’ and even customers who have been lapsed for a number of years are a far greater prospect to approach once more than any attempt to trawl for a fresh batch.  If customers are not so easily acquired, and if retention proves less difficult than in the mainstream sector, it follows that it is disproportionately easier to entice an older customer than it is to acquire a new one. It seems clear that these markets inherently find the mainstream parameter of dormancy at two years to be inappropriate for their purposes.

Extending the Binary System from Two Years to Three

The Binary system’s strengths are its customer-centricity, its ability consistently to predict the difference in response between more regular-and-recent and less regular-and-recent customers and its scalability.  Its weakness is the fact that we can prove that it has omitted perfectly responsive customers.  Perhaps this can be corrected by using its scalability to ensure that they are re-admitted into the process.Under a four-season (two-year) standard model, the categories are defined by fifteen groups, which is the number of permutations of order activity (or inactivity) across four seasons.  One point is awarded for the least recent reported season (four seasons ago), there are two points for an order three seasons ago, four points for orders from the penultimate season and eight for the last season.  The number of points awarded doubles, the more recent the season, which seems like an arbitrary system but is actually an ingenious mechanism to ensure that every single permutation is represented by a different number of points. 

In this way, we may contend that Recency is a vital factor in predictive modelling whilst also expecting to target customers that are patently less Recent in profile.  The crucial point being that we have evidence that suggests we cut off responsive customers too readily by adhering to a ‘two-year rule’. By re-introducing segments of longer-dormant customers, we become able to evaluate their relative value – and therefore the predictiveness of this wider flavour of Binary analysis. Like the current four-season model, there’s also the thorny issue to consider of how many high-performing customer segments that even this model may continue to ignore. 

We can’t turn back time but we can simulate the conditions of a six-season Binary selection.  It is possible to re-order the customers we may have selected for the current Autumn/Winter campaign using a six-season Binary model.  From there, we can identify not only which customers were mailed but also which customers placed an order in the current campaign and compare them with the equivalent responses using the usual 15-point, four-season Binary model.  With six seasons, the number of permutations of orders increases from 15 to 63.  

This hugely increases the level of granularity that the list analysis can give and will also help to establish the importance of the 5th-last and 6th-last season on predictiveness for a forthcoming campaign. Using four seasons (two years), the Binary graph for Activations from the current Autumn/Winter campaign to late November looks like this:

The same response data under a six-season (three year) Binary grouping shows a similar degradation but with more definition between high-performing and low-performing segments.

The added granularity helps to provide more evidence of predictiveness at each end of the Binary spectrum.  Almost two hundred more customers are classified in groups which yielded an Activation rate of over 40% than in the 15-point model and over six hundred more customers are classified in groups which yielded less than a 10% Activation rate.  If a 10% rate was shown to be the break-even point for inclusion, then this information would identify names who the Binary model would not predict a sufficient response.  If no other justification could be found to mail those names, then that information could demonstrate a saving of unnecessary expenditure.

A ‘Health Warning’ for Any Model of Segmentation with a Single Axis

As we’ve already seen, demonstrating a suitably stratified segmentation model is only the first requirement of achieving a fully-optimised list.  We must also ensure that no other potentially responsive segments are omitted.  I’ve also highlighted the almost inevitable need for some subsequent segmentation criteria to exist beyond the reaches of the primary (in this case, Binary) model.  Not only should this process stimulate as many as possible of the remaining responsive segments (a ‘best-of-the-rest’ group), it should also seek to test other responsive techniques beyond that. 

A good example of that methodology would be the segmentation of customers, irrespective of Binary and Points, who have previously ordered during a Sale for a mailing of a Sale Catalogue.  This is based on a given principle (that a customer is a known Sale responder). In the field of probability, this is known as Conditional Probability: where a given condition already exists, results in outcomes with a higher degree of probability and therefore predictiveness.  The methodology appears sound but the result may or may not agree but either way, the results of that decision will shape our future selections.

Currently, our preferred secondary metric is the Points from our long-standing ‘PointsAnalysis’ table, which was created for our previous segmentation technique, where customers accrue 100 points every time they order, gain 1.5 points for every pound they spend and lose a point for every day that passes without an order.

In order to pursue this line of segment development, we will need to more clearly record what segments were used and on what basis.  Where transient variables such as Points are used, the figures at the date of segmentation need to be written back to the database to enable better, easier analysis and cross-reference between the segments used and their eventual performance.

As part of my reconstruction of a six-season Binary in Excel, I have been able to identify customers mailed with and responding to the Spring/Summer Deep Dive catalogues.  I have also been able to reverse engineer their historic Points level at around January 15th, based on their November Points and their known activity since January.  This graph is what that analysis suggests.  I can’t guarantee perfect accuracy within each Points band but I can say that the totals for each group match those given by a report for the activity of groups -1 to -5.

These responses strongly suggest that there are responsive customers to be found outside of the 4-season Binary model we employed in that season, bearing in mind that the Activation level for Binary group 1 was 4.6% and the “-1” Deep Dive group yielded around 18% Activation.


There’s nothing wrong with mailing across multiple axes of segmentation, as long as the hierarchy is established (if a customer qualifies for a segment in each method, which one wins and which method is left with the rest?) and as long as each segment is performing well.  Curves which become too horizontal may still be predictive at the level of each category but also show that the method itself has begun to lose its predictiveness at that point.  Thought should be given to the point in the list/on the axis at which one model is abandoned and another is given free rein to replace it.

For example, using the derived ‘live’ data for the current campaign below, comparing six-season Binary with Deep-Dive, based on Points, it may be concluded that, long tail or not, groups 1-4 should not be mailed but those quantities replaced by the best of the rest on the Points scale.

There are of course too many variables to merely prescribe a ‘one-size-fits-all’ answer here.  Issues of quantities of names available within each group together with associated AOVs and break-even Activation levels all play a part.  The main issue at this stage is that we give ourselves the impetus, and the tools, to break away from a single system of segmentation, as long as our focus remains at the customer level. 

Whatever we do, it should be a far more scientific process than simply betting on the horses…