Thursday 25 October 2012

Why 'everything' is a database


There was a character in the late 90s sketch show ‘Goodness Gracious Me’ who kept annoying his son by claiming that everyone of note came from India:
Da Vinci? Indian. The Queen? Indian. Picasso? Indian.

I have a similar trait to that character except my ubiquitous reference is ‘Database’:
Google? Database. Facebook?  Database. Twitter? Database.

Ultimately all big organisations are doing the same thing, just in slightly different ways: they all collect huge amounts of data with the difference being how they pass that back to users with they key being how they store, manipulate and disseminate.

What’s all this got to do with football?  Well, looking at the MCFC Analytics data I was struck by the similarities between this and the kind of data you might see within a normal customer database, the data is provided at a level of one record per player per match which could be considered to be like items from an order, each order has multiple items and each customer (Team) has multiple orders.

From here the natural step is to turn a load of data into summary views which would provide the starting point of any analysis which in database marketing terms would be:

Single Team View – One record per Team
Single Match View – One record per Match
Single Player View – One record per Player

The insight usually comes not just from aggregating the raw data but from manipulating it to create extra variables which give a greater depth of understanding beyond just totals and averages.

The first one of these I have put together is the single team view, the main part of this is just totalling the details of the individual players (along with the own goals data) but also adding other details added in around each team.

This produces a table of nearly 200 hundred columns, so is fine as a data source but looking at it for any length of time will give you a headache.  The job of any analyst should be to be able to take this and make something more user friendly.

To that end I have produced a summary dataset called single team view summary.xls which is one record for each of the teams which as well as having the usual goals scored/conceded also has some other information which I think is pretty interesting.

Much has been made about Newcastle possibly punching above their weight (i.e., lucky) and possibly in store for a more average season this time.  It’s certainly true that there are a number of stats which suggest they over performed:
  • Newcastle only had more shots than the opposition in 15 of their 38 games around half of the number of teams around them in the table.
The top 4 (plus Chelsea and Liverpool) had more shots than the opposition in the majority of their matches
  • They conceded 2 ‘Big Chances’ for every one ‘Big Chance’ they had (ratio of 0.67 Big Chances created per Big Chance conceded), Chelsea are the only other top half team where the ratio is less than 1.  Where a 'Big Chance' is described as an opportunity where a goal would be expected.
For this metric, the top 4 (plus Everton, Liverpool and Fulham) are the only sides to create more 'Big Chances' than they concede
  • For the majority of their games, Newcastle had fewer passes and fewer final third passes than their opponents where the rest of the top 6 dominated.
The traditional 'Big Six' were the teams that tended to dominate passing (especially final third passes), with Swansea and Stoke being outliers.

Liverpool were arguably the opposite of Newcastle in terms of dominating games but not seeing it returned in points but although luck may play some part in results, the ability to be clinical in front of goal (Newcastle:11.5% of shots were goals) or not (Liverpool: 7%) is not some random event but is arguably something a manager may have little control over on the day itself but does in terms of signings and selection.

Other things of interest were Swansea making more passes than the opposition in 33 of their 38 games, but only more final third passes in 9 games with Stoke being the opposite, having just 3 games where they made more passes but 12 where they made more final third passes.

There are an almost infinite number of ways of reformatting the MCFC Analytics dataset and the output above is only the tip of the iceberg.  Given the amount of data involved it may be that collaboration and sharing of datasets is the fastest way to gain an overall understanding of the data.

The spreadsheet behind the figures above (which contains a number of other derived metrics including home/away splits) is available at: https://skydrive.live.com/redir?resid=A1BA00769DC2D906!105 along with the Own Goals data and other Premier League related output.

Dan Barnett
Director of Analytics


No comments:

Post a Comment