Wednesday 12 September 2012

MCFC Analytics - Some thoughts about data


The release by Opta/Manchester City of player data for the 2011/12 season is something that could potentially open up a whole new area of analytics to the wider public which previously would have been restricted to those working at the clubs.

More details are available here but essentially what has been released is a dataset of one record, per player per match for all games of the 2011/12 season with details such as goals scored, passes attempted/made etc.,

This post is more concerned with the data aspect of the project than with the practical application of the data (which will come in later posts and on my Swansea City blog www.wearepremierleague.com).

The raw supplied Excel file contains 10,369 records (excluding column headings) and 210 columns, so even though it’s just a summary of a players activity within a game it’s still a sizeable file.

Most of the initial toying of the data I have done with Excel (in particular pivot tables), but I’m using Access for the more detailed manipulation as often easier to manipulate data in a database rather than a spreadsheet. 

Below are a few details around changes and derivations I have made from the initial file.  Apparently over 5,000 people have requested the file.  This shows a huge level of interest but also means that without a sufficiently quick feedback loop for the data, there will be a lot of people doing the same sort of processing that could have just been done once and also leaves the data open to different interpretations rather than one true set of metrics.
Own Goals
An example of this is that the dataset doesn’t directly contain information on own goals, it would be an easy mistake (as I did initially) to think you could just sum the total of the ‘Goals’ column to get total goals scored by Team.

What the data does have however is the total goals conceded by the Goalkeeper on the pitch at that time so if you know the number of goals the opposition team has conceded in a game and the number of goals ‘your’ team has scored then:

Goals for your team coming from opposition own goals = Total Goals Conceded by Opposition in match – Total Goals Scored by your team

To do this I have created a summary table of one record per team per match from the initial data, with the image below showing some of the fields for the Swansea – Chelsea game where Neil Taylor of Swansea scored an own goal:



Where Total_Goals_exc_own_goals is the sum of the ‘Goals’ field in the raw dataset.

I then created a second table which has details of goals conceded per team:



From these two tables you can see that no Chelsea player scored a goal in this game but that Swansea conceded one.

I then updated a ‘Total Goals Scored’ field in the first table by matching the ‘Team’ in the original table to the ‘Opposition’ in the second table and also the ‘Opposition’ in the first table with the ‘Team’ in the second.  

As a extra measure in case at some point in the future the data has more than one match where Swansea were home to Chelsea I also matched on date.

This then gives the following information:





From this 1 record per team per match summary, you can then create an overall summary of goals scored:





















This is interesting as much was made of Liverpool's 'bad luck' in hitting the woodwork so many times last season, but not heard as much about their 'good luck' regarding own goals.

Derived Fields
In addition to the fields supplied by Opta, it’s likely that you’d want a number of extra derived fields added, as mentioned previously it could be beneficial to have a process where there is a latest approved version of the file available for people to use that has a number of agreed extra fields to avoid everyone having to create these themselves.

One example of this would be having a ‘Total Shots’ field as with the raw data there is no total but only the constituent parts (On Target/Off Target/Blocked).  

As with anything of this nature there’s the balance between everyone using consistent definitions/data and the fact that extra fields means bigger file sizes.

Another example of a derived field might be having a standardised name format: If you want to be able to filter by name, it makes more sense to have the name in a single field rather than having forename and surname separately.  It also removes the strange anomaly in the data that ‘Adam Johnson’ is listed in the surname field rather than ‘Adam’ as forename and ‘Johnson’ as Surname.

There are a few cases where a player is genuinely only known by one name (e.g., Alex at Chelsea) so using excel/access we can create a Player Name field by taking Forename and Surname where both supplied or just Surname where only Surname supplied.

This however doesn’t create a unique field to filter on as there was a Paul Robinson at both Bolton and Blackburn last season and also cases where the same player played for multiple clubs; the easiest way around this is to add the players club on to the name when creating the field that will be used for filtering by name.  This gives the option to then filter by person or by person at a specific club.

Also, the raw data contains a player ID which you can use to differentiate between where it’s the same person for two teams or two different players.

Metadata
The raw dataset contains only instances where a player gets on the pitch so needs a bit of rejigging to fill in any potential blanks.

I’ve done this by using the raw data to create a summary table of all matches (grouping the table by Team/Opposition/Venue/Date e.g.




This then gives a list of all the fixtures for all the teams (20 teams playing 38 matches = 760 records).

The next step was to create a deduplicated list of all players for each team e.g., Joshua (Josh) McEachran has an entry for both Chelsea and Swansea.  This gives a list of 561 players, matching this to the fixture table (matching by team) gives a total of 21,318 records (561 players for each of 38 matches).

This gives the ability to create a dataset which includes details of where a player takes no part in a game such as the chart below showing shots by game for Wayne Rooney.  The blanks are where he didn’t play (as opposed to the zero values which are where he had no shots).














All of the above is still just scratching the surface (even before the more detailed release of within game player actions) but hopefully begins to make the point about creating open source (and approved) modified datasets to avoid large scale duplication of work as well as issues around differing definitions.

Update - 14th Sep.

I have now created a spreadsheet in the same format as the original dataset which has 40 records containing own goals data.  If you add this to the original spreadsheet then the 'Goals' and 'Goals Conceded' totals will now tally.

Spreadsheet is available at: https://skydrive.live.com/redir?resid=A1BA00769DC2D906!105

Dan Barnett

Director of Analytics