Showing posts with label MCFCAnalytics. Show all posts
Showing posts with label MCFCAnalytics. Show all posts

Thursday, 25 October 2012

Why 'everything' is a database


There was a character in the late 90s sketch show ‘Goodness Gracious Me’ who kept annoying his son by claiming that everyone of note came from India:
Da Vinci? Indian. The Queen? Indian. Picasso? Indian.

I have a similar trait to that character except my ubiquitous reference is ‘Database’:
Google? Database. Facebook?  Database. Twitter? Database.

Ultimately all big organisations are doing the same thing, just in slightly different ways: they all collect huge amounts of data with the difference being how they pass that back to users with they key being how they store, manipulate and disseminate.

What’s all this got to do with football?  Well, looking at the MCFC Analytics data I was struck by the similarities between this and the kind of data you might see within a normal customer database, the data is provided at a level of one record per player per match which could be considered to be like items from an order, each order has multiple items and each customer (Team) has multiple orders.

From here the natural step is to turn a load of data into summary views which would provide the starting point of any analysis which in database marketing terms would be:

Single Team View – One record per Team
Single Match View – One record per Match
Single Player View – One record per Player

The insight usually comes not just from aggregating the raw data but from manipulating it to create extra variables which give a greater depth of understanding beyond just totals and averages.

The first one of these I have put together is the single team view, the main part of this is just totalling the details of the individual players (along with the own goals data) but also adding other details added in around each team.

This produces a table of nearly 200 hundred columns, so is fine as a data source but looking at it for any length of time will give you a headache.  The job of any analyst should be to be able to take this and make something more user friendly.

To that end I have produced a summary dataset called single team view summary.xls which is one record for each of the teams which as well as having the usual goals scored/conceded also has some other information which I think is pretty interesting.

Much has been made about Newcastle possibly punching above their weight (i.e., lucky) and possibly in store for a more average season this time.  It’s certainly true that there are a number of stats which suggest they over performed:
  • Newcastle only had more shots than the opposition in 15 of their 38 games around half of the number of teams around them in the table.
The top 4 (plus Chelsea and Liverpool) had more shots than the opposition in the majority of their matches
  • They conceded 2 ‘Big Chances’ for every one ‘Big Chance’ they had (ratio of 0.67 Big Chances created per Big Chance conceded), Chelsea are the only other top half team where the ratio is less than 1.  Where a 'Big Chance' is described as an opportunity where a goal would be expected.
For this metric, the top 4 (plus Everton, Liverpool and Fulham) are the only sides to create more 'Big Chances' than they concede
  • For the majority of their games, Newcastle had fewer passes and fewer final third passes than their opponents where the rest of the top 6 dominated.
The traditional 'Big Six' were the teams that tended to dominate passing (especially final third passes), with Swansea and Stoke being outliers.

Liverpool were arguably the opposite of Newcastle in terms of dominating games but not seeing it returned in points but although luck may play some part in results, the ability to be clinical in front of goal (Newcastle:11.5% of shots were goals) or not (Liverpool: 7%) is not some random event but is arguably something a manager may have little control over on the day itself but does in terms of signings and selection.

Other things of interest were Swansea making more passes than the opposition in 33 of their 38 games, but only more final third passes in 9 games with Stoke being the opposite, having just 3 games where they made more passes but 12 where they made more final third passes.

There are an almost infinite number of ways of reformatting the MCFC Analytics dataset and the output above is only the tip of the iceberg.  Given the amount of data involved it may be that collaboration and sharing of datasets is the fastest way to gain an overall understanding of the data.

The spreadsheet behind the figures above (which contains a number of other derived metrics including home/away splits) is available at: https://skydrive.live.com/redir?resid=A1BA00769DC2D906!105 along with the Own Goals data and other Premier League related output.

Dan Barnett
Director of Analytics


Wednesday, 12 September 2012

MCFC Analytics - Some thoughts about data


The release by Opta/Manchester City of player data for the 2011/12 season is something that could potentially open up a whole new area of analytics to the wider public which previously would have been restricted to those working at the clubs.

More details are available here but essentially what has been released is a dataset of one record, per player per match for all games of the 2011/12 season with details such as goals scored, passes attempted/made etc.,

This post is more concerned with the data aspect of the project than with the practical application of the data (which will come in later posts and on my Swansea City blog www.wearepremierleague.com).

The raw supplied Excel file contains 10,369 records (excluding column headings) and 210 columns, so even though it’s just a summary of a players activity within a game it’s still a sizeable file.

Most of the initial toying of the data I have done with Excel (in particular pivot tables), but I’m using Access for the more detailed manipulation as often easier to manipulate data in a database rather than a spreadsheet. 

Below are a few details around changes and derivations I have made from the initial file.  Apparently over 5,000 people have requested the file.  This shows a huge level of interest but also means that without a sufficiently quick feedback loop for the data, there will be a lot of people doing the same sort of processing that could have just been done once and also leaves the data open to different interpretations rather than one true set of metrics.
Own Goals
An example of this is that the dataset doesn’t directly contain information on own goals, it would be an easy mistake (as I did initially) to think you could just sum the total of the ‘Goals’ column to get total goals scored by Team.

What the data does have however is the total goals conceded by the Goalkeeper on the pitch at that time so if you know the number of goals the opposition team has conceded in a game and the number of goals ‘your’ team has scored then:

Goals for your team coming from opposition own goals = Total Goals Conceded by Opposition in match – Total Goals Scored by your team

To do this I have created a summary table of one record per team per match from the initial data, with the image below showing some of the fields for the Swansea – Chelsea game where Neil Taylor of Swansea scored an own goal:



Where Total_Goals_exc_own_goals is the sum of the ‘Goals’ field in the raw dataset.

I then created a second table which has details of goals conceded per team:



From these two tables you can see that no Chelsea player scored a goal in this game but that Swansea conceded one.

I then updated a ‘Total Goals Scored’ field in the first table by matching the ‘Team’ in the original table to the ‘Opposition’ in the second table and also the ‘Opposition’ in the first table with the ‘Team’ in the second.  

As a extra measure in case at some point in the future the data has more than one match where Swansea were home to Chelsea I also matched on date.

This then gives the following information:





From this 1 record per team per match summary, you can then create an overall summary of goals scored:





















This is interesting as much was made of Liverpool's 'bad luck' in hitting the woodwork so many times last season, but not heard as much about their 'good luck' regarding own goals.

Derived Fields
In addition to the fields supplied by Opta, it’s likely that you’d want a number of extra derived fields added, as mentioned previously it could be beneficial to have a process where there is a latest approved version of the file available for people to use that has a number of agreed extra fields to avoid everyone having to create these themselves.

One example of this would be having a ‘Total Shots’ field as with the raw data there is no total but only the constituent parts (On Target/Off Target/Blocked).  

As with anything of this nature there’s the balance between everyone using consistent definitions/data and the fact that extra fields means bigger file sizes.

Another example of a derived field might be having a standardised name format: If you want to be able to filter by name, it makes more sense to have the name in a single field rather than having forename and surname separately.  It also removes the strange anomaly in the data that ‘Adam Johnson’ is listed in the surname field rather than ‘Adam’ as forename and ‘Johnson’ as Surname.

There are a few cases where a player is genuinely only known by one name (e.g., Alex at Chelsea) so using excel/access we can create a Player Name field by taking Forename and Surname where both supplied or just Surname where only Surname supplied.

This however doesn’t create a unique field to filter on as there was a Paul Robinson at both Bolton and Blackburn last season and also cases where the same player played for multiple clubs; the easiest way around this is to add the players club on to the name when creating the field that will be used for filtering by name.  This gives the option to then filter by person or by person at a specific club.

Also, the raw data contains a player ID which you can use to differentiate between where it’s the same person for two teams or two different players.

Metadata
The raw dataset contains only instances where a player gets on the pitch so needs a bit of rejigging to fill in any potential blanks.

I’ve done this by using the raw data to create a summary table of all matches (grouping the table by Team/Opposition/Venue/Date e.g.




This then gives a list of all the fixtures for all the teams (20 teams playing 38 matches = 760 records).

The next step was to create a deduplicated list of all players for each team e.g., Joshua (Josh) McEachran has an entry for both Chelsea and Swansea.  This gives a list of 561 players, matching this to the fixture table (matching by team) gives a total of 21,318 records (561 players for each of 38 matches).

This gives the ability to create a dataset which includes details of where a player takes no part in a game such as the chart below showing shots by game for Wayne Rooney.  The blanks are where he didn’t play (as opposed to the zero values which are where he had no shots).














All of the above is still just scratching the surface (even before the more detailed release of within game player actions) but hopefully begins to make the point about creating open source (and approved) modified datasets to avoid large scale duplication of work as well as issues around differing definitions.

Update - 14th Sep.

I have now created a spreadsheet in the same format as the original dataset which has 40 records containing own goals data.  If you add this to the original spreadsheet then the 'Goals' and 'Goals Conceded' totals will now tally.

Spreadsheet is available at: https://skydrive.live.com/redir?resid=A1BA00769DC2D906!105

Dan Barnett

Director of Analytics