Showing posts with label Opta. Show all posts
Showing posts with label Opta. Show all posts

Friday, 4 April 2014

Football: Big Data and Small Data problems

Last week I attended the Sports Analytics Innovation Summit held at the Emirates which had a range of speakers talking about the use of data in a variety of sports and different areas such as performance, psychology and fitness. 

In his review Sky Sports' Adam Bate pretty much hits the nail on the head in that you need to actually apply the data not just collect it for the sake of it. There's no doubt that the use (or at least collection) of data in football is becoming increasingly prevalent but it's the 'So What' factor that's critical. 

I liked the story from a few months of Forest Green who are a Conference club ditching Prozone, not least for the use of the word 'malarkey' in the local newspaper headline (I imagine them outside the ground with a 'Down with this sort of thing' placard against the use of modern technology in sport).
Top 2 results for 'Forest Green Prozone' on Google, professional step in Feb '13 but binned by new boss in December after manager Dave Hockaday leaves 'by mutual consent' in October.
Some of the quotes by the new manager Ady Pennock make him look like quite a traditionalist:

"I am a great believer in what I see and my eyes don’t lie, so I don’t need a bit of paper...The most important stat is the scoreline and I don’t want Prozone for the sake of having it"
It might seem a bit backward but I'd rather someone had the courage of their convictions rather than spending money on a product that gathers dust just because 'it's what the elite clubs do'.

Clubs are at the stage now where they are facing both 'Big Data' and 'Small Data' problems and it's how they deal with these that will determine the level of advantage they get over their peers.  

Analytics is just another enabler like better training facilities, diet, sleep patterns etc., there's no magic solution but small gains can make big differences to final outcomes.

Big Data Problems
As a Data Analyst, I probably hear or read the phrase 'Big Data' a dozen times a day (almost as many times as I have to watch a presentation that has a YouTube clip of Moneyball included) and a lot of the time it's used like its predecessor CRM (Customer Relationship Management) as a buzzword to try and sell you something you don't really need or to make something that's relatively mundane sound a bit more interesting.

Ultimately it comes down to the fact that it costs very little to store information and has becoming increasingly easy to capture information, so there is the desire to capture as many things as possible as frequently as possible regardless of if it has any real value.

The most obvious example of 'Big Data' in football would be Prozone where each player (and the ball I'm assuming) are tracked 10 times a second (some systems in sports such as NBA track it 25 times a second) so a 95 minute match after injury time gives 1.3m records per match.  It's not small but nothing compared to what a web company may store.

Even the most data-savvy manager is not going to want to wade through that much data so it'll be the job of Performance Analysts (and Prozone themselves) to to try and gain insight from the data, naturally the first stop will be top level metrics like top speed, distance run, #sprints etc., but it'll be the ability to go beyond this and be able to interrogate the data in more detail that'll make the difference and is probably where the new Forest Green Rovers manager is coming from, if you haven't got the resource to even scratch the surface of what the data could tell you, what's the point in having it.

Similarly for training data, you may have GPS data, heart rate, saliva, sleep diaries but you need to be able to go from a bunch of data to something that can change what you do with players.

Small Data Problems
Football also suffers from 'Small Data' problems in terms of small sample sizes both in terms of number of matches and number of players involved.  If one player scores 10 goals in a season and another 13 which one is the better one?  Even if you factor in things like expected goals (chance of any shot being scored so an effort from 6 yards is different to one from 40 yards), you're still going to be left with a fair amount of doubt as to which will perform better next season even if any estimate is far better than just guesswork.

Where there is doubt, there is the overwhelming temptation to not even try and be scientific and just go on 'gut feel' which comes back to the '..my eyes don't lie' comment, even though it's incredibly difficult to be 100% objective.

One obvious issue is that of confirmation bias.  As a Swansea fan, a good example for me is Dwight Tiendalli but more well known examples would be Tom Cleverly or Martin Demichelis where they are expected to be terrible, so every bad pass or missed tackle is seen as confirmation of this and any good play conveniently ignored.  This isn't to say that people's opinions are necessarily wrong overall, just that in any given match, the presumption of failure is already present before kick-off.


There's been plenty of talk recently about over-playing players and injury and I had a look a couple of weeks ago at the link between playing time and hamstring injuries for a few high profile players, one of the charts looks at Mesut Ozil's playing minutes and injuries:
Ozil's playing minutes over the previous 7/14/21 days along with injury activity, was overwork after return from injury in Feb responsible for injury in March?
The problem here is that there is generally too little data (especially publicly) to have any real knowledge as to cause and effect (there's always the risk you're looking backwards for possible factors once someone is injured, ignoring the time they or others exhibit the same activity but didn't get injured). 

You may have a small pool of players who have started playing again after a relatively minor injury but how many of them then follow the same playing schedule as Ozil and also a similar playing style in terms of distance run, sprints etc., and a similar physique.  It was interesting to see some of the doctor's presenting at the conference talking about pooling data (anonymously) which would improve the situation and this is taking place at some levels within UEFA.

There's also the issue around short-termism, it may well be that a particular strategy/approach is the best over a longer period of time (e.g., limiting a player's match time) but a lot of the time anything more than a fortnight away might be classified as 'long-term planning'.

If you imagine Wenger deciding whether or not to start Ozil against Bayern on the Tuesday after playing him for 90 minutes against Everton the previous Saturday:  at what level of likelihood of injury would he decide not to play him? 5%,10%,20% 50%? And when he does get injured is that particular instance bad luck or bad planning?

Just buying a piece of software won't solve your problems, but just as surely it's impossible for any one person to be able to collate, retain, process and analyse all information that may be useful in creating better performance.

Analytics is taken most seriously at the elite clubs but I'd argue that the greatest incremental benefit would be for Championship (or ambitious League 1) clubs, there's enough money there for it to be worthwhile and also you're more likely to be doing something different to your peers.

Overall, clubs are faced with the dual issues of having 'too much' data in some instances and 'not enough' in others. This is where the skill of an analyst comes in who can process the data, find the insights and present back in a way that actions are actually taken off the back of the data.  They'll also know the difference between what's interesting, what's actually important and what is just noise.

For me a performance analyst and a statistician are needed to work together to combine technical/programming/statistical skills needed to work the data, in tandem with someone who will be more closely involved in the application of any findings (if you can find someone who can do both, fantastic, make sure they never leave).

I'm naturally biased given my background, but the obvious solution for me is for clubs to bring in Data Analysts for the off-field work (Season Ticket Analysis, Club Shop, Social Media Analysis) but to free up some of their time to look at on-field data.  This way you have someone 'paying their way' even before they get to the football data.  I'd argue they'd probably contribute a greater amount to the bottom line if working fully on football data but that may be too much of a leap of faith for some clubs, at least for now.  

Other Posts:
Match Predictions: Are you Smarter than Lawro 

--
Dan Barnett

Director of Analytics
Analysis Marketing Ltd


Twitter: @analysismktg 

Wednesday, 23 January 2013

Comparison of Opta stats providers

It may just be self-selection based on the kind of things I read, but there seems to be an ever growing interest in data in football and the subject appears to be moving away from the niche into the mainstream with increasing mentions in the press such as a recent article in The Guardian.

This is partly due to more and more sites making use of data in football, in particular from Opta.  In this post I'll look at the pros/cons of a number of sites/apps that use Opta data and their comparative strengths and weaknesses.

When Swansea City reached the Premier League with promotion in May 2011, I decided to set up the blog www.wearepremierleague.com to combine my interest in stats with that of the Swans. Generally speaking there is a paucity of (publicly available) data around activity in lower leagues - although credit must go to Ben Mayhew for his attempt to rectify this at Experimental 361. The level of detail publicly available for the top leagues in Europe however is still far beyond that in the Championship and below.

Guardian Chalkboards
When I started out, this was one of the few resources about and had the advantage of being free and web (not app) based.  I won't go in to too much detail about it as its sadly no more (possibly ahead of its time?) but the thing I liked most about it was to be able to visualise the activity with regard to where on the pitch it took place.

The image below shows a Swansea goal against Blackburn where every Swansea player touched the ball in the move.
The addition of squad numbers to activity gives a level of detail not available anywhere else I've looked
Stats Zone
The demise of Guardian Chalkboards a couple of months into the season was the nudge I needed to get an iPod Touch to be able to use the Stats Zone app.

Stats Zone is great for both looking at the top level stats (e.g., Shots per Team) or delving in to the detail of a particular match (e.g., Long passes by a particular player).
Example of a Stats Zone Screen shot, in this case comparing the Aerial Duel activity of Peter Crouch with the Stoke team as a whole
Stats Zone is produced in conjunction with FourFourTwo magazine and their website includes blogs produced by Opta and Zonal Marking and others.

I combined my interest of football with that of data analysis in the creation of a Premier League Review dashboard, which is a interactive presentation taking a number of images from Stats Zone.

WhoScored.com
Whenever Swansea are linked with a particular player (usually from La Liga), WhoScored is the first site I go to as it has in-depth details for any player across the major European leagues:
WhoScored has details both on overall activity for that season as well as the ability to drill down in to activity for a particular game
WhoScored also has a fairly comprehensive list of stats for any particular match with the ability to order ascending/descending on these metrics for each played within a team (Long Balls, Chances Created etc,.) and also blogs from a number of respected writers.

Squawka Sports
Squawka.com is to some extent a cross between Stats Zone and WhoScored in that you can look at activity of individual players across the season as a whole, but also look at specific types of actions graphically for a specific player in a particular match e.g., Canas' passes vs. Malaga
Squawka goes for a dashboard approach for presenting a lot of its data
EPLIndex.com
The level of detail available in the sites/app mentioned above will be enough for the majority of people but for those wanting even more, there is the pay-for site EPLIndex.com (£3.95 a month/£40 a year) which has even more detail.

Where WhoScored for example might have total passes and pass accuracy, EPL Index will break this down even further e.g., Passes/Accurate passes in Own Half/Attacking Half/Final Third:
EPL Index Screenshot - huge amount of data across numerous tabs
The level of detail of this data is pretty much the same as the release from Opta/Manchester City of the summary stats for the 2011/12 season, just not in a single spreadsheet.

One of the other advantages of EPL Index is that it has data for multiple seasons making comparisons such as one I did recently comparing Danny Graham and Kenwyne Jones possible:
Example of the kind of thing its possible collate using data supplied by EPL Index
As well as the option of subscribing to stats, for those who just want to read about stats and football the site has an ever growing number of authors who use the data to write and publish their own analysis to a level of detail which is arguably a depth of analysis rarely seen anywhere.

Relative Strengths and Weaknesses of each source

Stats Zone - Strengths:
  • Ability to visualise activity e.g., location of Shots/Interceptions etc., 
  • Includes simple top level summaries e.g., total tackles made ordered by all players not split by team as is the case in the other sources
  • Ability to drill into data within the game e.g., compare first 62 minutes with last 28
  • Ability to create bespoke comparisons across matches/teams e.g., Chances made by John Walters in first 30 minutes vs. Aston Villa compared to Chances made by Stoke vs. West Brom  
Stats Zone - Weaknesses:
  • Apple devices only - no Android or Web version
  • Lacks ability to see multiple stats simultaneously e.g., Tackles/Passes/Shots per player
  • Doesn't have stats collated across a season

WhoScored - Strengths:
  • Includes data on all major European Leagues and Champions League
  • Easiest site to navigate around between stats for Team/Player/Match
  • Best for comparing statistics across teams, form/shots per game
WhoScored - Weaknesses:
  • Little visualisation of data - there is a nice image of shot areas but not the chalkboards such as those from Stats Zone/Squawka
  • No ability to analyse activity within a game e.g., compare 1st and 2nd half stats

Squawka - Strengths:
  • Has ability to easily track metrics for a team or player for a single match or across season
  • Includes heat maps of activity by player/team
  • Ability to drill down within part of the game (currently 5 minute intervals)
  • Lots of charts as well as raw data, multiple options for visualising the same data
Squawka - Weaknesses:
  • Doesn't have the same level of detail of stats readily available as other sites although only likely to bother the really in-depth user
  • Good to have charts but some could be better e.g., if a player has played in 15 of 22 league games only stats for those 15 shown.  Personally would like to see the blanks to know where over the season that player hasn't featured
  • Stats Zone plots chalkboards from the point of viewing of the team your analysing attacking from left to right, Squawka plots them with Home team playing from left to right which can be annoying when trying to compare areas of attack/passing

EPL Index - Strengths:
  • Most in-depth of any of the data sources
  • Has league data going back to 2008/9 season
  • Top-Stats feature gives ability to find best players across a range of metrics with ability to filter by those playing at least x minutes in a game or total minutes across a season (e.g., avoids problem of someone coming top in pass completion % with 1 pass from 1 attempt)
EPL Index - Weaknesses:
  • Pay-for site
  • No ability to analyse activity within a game
  • Generally best thought of as a source of data from which you create something yourself 

Turning Data into Insight
Although each of these companies is taking the same (or at least similar data) from Opta, it can be seen that they have each used it in different ways and are all still improving as time goes on. Eventually I'd imagine that one of these sites (or a newer entrant such as Sky) will bring all these parts together, possibly also including video for a complete experience.

As an example, a lot has been made recently about David de Gea pushing balls back into dangerous areas when he makes a save, the raw data will only tell you so much but to be able to view all his saves or saves where there is a goal in subsequent 10 seconds would give an even more detailed picture.

TV rights are far to precious to be given away but the ability to create your own highlights package (e.g., All chances created by Pablo Hernandez, with approx 10 seconds of footage per chance created) could take interactive entertainment to a new level.

Other Posts:  Man City and TwitterTwitter and Bookies - A Case Study , Premier League Weekly Review

Thursday, 25 October 2012

Why 'everything' is a database


There was a character in the late 90s sketch show ‘Goodness Gracious Me’ who kept annoying his son by claiming that everyone of note came from India:
Da Vinci? Indian. The Queen? Indian. Picasso? Indian.

I have a similar trait to that character except my ubiquitous reference is ‘Database’:
Google? Database. Facebook?  Database. Twitter? Database.

Ultimately all big organisations are doing the same thing, just in slightly different ways: they all collect huge amounts of data with the difference being how they pass that back to users with they key being how they store, manipulate and disseminate.

What’s all this got to do with football?  Well, looking at the MCFC Analytics data I was struck by the similarities between this and the kind of data you might see within a normal customer database, the data is provided at a level of one record per player per match which could be considered to be like items from an order, each order has multiple items and each customer (Team) has multiple orders.

From here the natural step is to turn a load of data into summary views which would provide the starting point of any analysis which in database marketing terms would be:

Single Team View – One record per Team
Single Match View – One record per Match
Single Player View – One record per Player

The insight usually comes not just from aggregating the raw data but from manipulating it to create extra variables which give a greater depth of understanding beyond just totals and averages.

The first one of these I have put together is the single team view, the main part of this is just totalling the details of the individual players (along with the own goals data) but also adding other details added in around each team.

This produces a table of nearly 200 hundred columns, so is fine as a data source but looking at it for any length of time will give you a headache.  The job of any analyst should be to be able to take this and make something more user friendly.

To that end I have produced a summary dataset called single team view summary.xls which is one record for each of the teams which as well as having the usual goals scored/conceded also has some other information which I think is pretty interesting.

Much has been made about Newcastle possibly punching above their weight (i.e., lucky) and possibly in store for a more average season this time.  It’s certainly true that there are a number of stats which suggest they over performed:
  • Newcastle only had more shots than the opposition in 15 of their 38 games around half of the number of teams around them in the table.
The top 4 (plus Chelsea and Liverpool) had more shots than the opposition in the majority of their matches
  • They conceded 2 ‘Big Chances’ for every one ‘Big Chance’ they had (ratio of 0.67 Big Chances created per Big Chance conceded), Chelsea are the only other top half team where the ratio is less than 1.  Where a 'Big Chance' is described as an opportunity where a goal would be expected.
For this metric, the top 4 (plus Everton, Liverpool and Fulham) are the only sides to create more 'Big Chances' than they concede
  • For the majority of their games, Newcastle had fewer passes and fewer final third passes than their opponents where the rest of the top 6 dominated.
The traditional 'Big Six' were the teams that tended to dominate passing (especially final third passes), with Swansea and Stoke being outliers.

Liverpool were arguably the opposite of Newcastle in terms of dominating games but not seeing it returned in points but although luck may play some part in results, the ability to be clinical in front of goal (Newcastle:11.5% of shots were goals) or not (Liverpool: 7%) is not some random event but is arguably something a manager may have little control over on the day itself but does in terms of signings and selection.

Other things of interest were Swansea making more passes than the opposition in 33 of their 38 games, but only more final third passes in 9 games with Stoke being the opposite, having just 3 games where they made more passes but 12 where they made more final third passes.

There are an almost infinite number of ways of reformatting the MCFC Analytics dataset and the output above is only the tip of the iceberg.  Given the amount of data involved it may be that collaboration and sharing of datasets is the fastest way to gain an overall understanding of the data.

The spreadsheet behind the figures above (which contains a number of other derived metrics including home/away splits) is available at: https://skydrive.live.com/redir?resid=A1BA00769DC2D906!105 along with the Own Goals data and other Premier League related output.

Dan Barnett
Director of Analytics


Wednesday, 12 September 2012

MCFC Analytics - Some thoughts about data


The release by Opta/Manchester City of player data for the 2011/12 season is something that could potentially open up a whole new area of analytics to the wider public which previously would have been restricted to those working at the clubs.

More details are available here but essentially what has been released is a dataset of one record, per player per match for all games of the 2011/12 season with details such as goals scored, passes attempted/made etc.,

This post is more concerned with the data aspect of the project than with the practical application of the data (which will come in later posts and on my Swansea City blog www.wearepremierleague.com).

The raw supplied Excel file contains 10,369 records (excluding column headings) and 210 columns, so even though it’s just a summary of a players activity within a game it’s still a sizeable file.

Most of the initial toying of the data I have done with Excel (in particular pivot tables), but I’m using Access for the more detailed manipulation as often easier to manipulate data in a database rather than a spreadsheet. 

Below are a few details around changes and derivations I have made from the initial file.  Apparently over 5,000 people have requested the file.  This shows a huge level of interest but also means that without a sufficiently quick feedback loop for the data, there will be a lot of people doing the same sort of processing that could have just been done once and also leaves the data open to different interpretations rather than one true set of metrics.
Own Goals
An example of this is that the dataset doesn’t directly contain information on own goals, it would be an easy mistake (as I did initially) to think you could just sum the total of the ‘Goals’ column to get total goals scored by Team.

What the data does have however is the total goals conceded by the Goalkeeper on the pitch at that time so if you know the number of goals the opposition team has conceded in a game and the number of goals ‘your’ team has scored then:

Goals for your team coming from opposition own goals = Total Goals Conceded by Opposition in match – Total Goals Scored by your team

To do this I have created a summary table of one record per team per match from the initial data, with the image below showing some of the fields for the Swansea – Chelsea game where Neil Taylor of Swansea scored an own goal:



Where Total_Goals_exc_own_goals is the sum of the ‘Goals’ field in the raw dataset.

I then created a second table which has details of goals conceded per team:



From these two tables you can see that no Chelsea player scored a goal in this game but that Swansea conceded one.

I then updated a ‘Total Goals Scored’ field in the first table by matching the ‘Team’ in the original table to the ‘Opposition’ in the second table and also the ‘Opposition’ in the first table with the ‘Team’ in the second.  

As a extra measure in case at some point in the future the data has more than one match where Swansea were home to Chelsea I also matched on date.

This then gives the following information:





From this 1 record per team per match summary, you can then create an overall summary of goals scored:





















This is interesting as much was made of Liverpool's 'bad luck' in hitting the woodwork so many times last season, but not heard as much about their 'good luck' regarding own goals.

Derived Fields
In addition to the fields supplied by Opta, it’s likely that you’d want a number of extra derived fields added, as mentioned previously it could be beneficial to have a process where there is a latest approved version of the file available for people to use that has a number of agreed extra fields to avoid everyone having to create these themselves.

One example of this would be having a ‘Total Shots’ field as with the raw data there is no total but only the constituent parts (On Target/Off Target/Blocked).  

As with anything of this nature there’s the balance between everyone using consistent definitions/data and the fact that extra fields means bigger file sizes.

Another example of a derived field might be having a standardised name format: If you want to be able to filter by name, it makes more sense to have the name in a single field rather than having forename and surname separately.  It also removes the strange anomaly in the data that ‘Adam Johnson’ is listed in the surname field rather than ‘Adam’ as forename and ‘Johnson’ as Surname.

There are a few cases where a player is genuinely only known by one name (e.g., Alex at Chelsea) so using excel/access we can create a Player Name field by taking Forename and Surname where both supplied or just Surname where only Surname supplied.

This however doesn’t create a unique field to filter on as there was a Paul Robinson at both Bolton and Blackburn last season and also cases where the same player played for multiple clubs; the easiest way around this is to add the players club on to the name when creating the field that will be used for filtering by name.  This gives the option to then filter by person or by person at a specific club.

Also, the raw data contains a player ID which you can use to differentiate between where it’s the same person for two teams or two different players.

Metadata
The raw dataset contains only instances where a player gets on the pitch so needs a bit of rejigging to fill in any potential blanks.

I’ve done this by using the raw data to create a summary table of all matches (grouping the table by Team/Opposition/Venue/Date e.g.




This then gives a list of all the fixtures for all the teams (20 teams playing 38 matches = 760 records).

The next step was to create a deduplicated list of all players for each team e.g., Joshua (Josh) McEachran has an entry for both Chelsea and Swansea.  This gives a list of 561 players, matching this to the fixture table (matching by team) gives a total of 21,318 records (561 players for each of 38 matches).

This gives the ability to create a dataset which includes details of where a player takes no part in a game such as the chart below showing shots by game for Wayne Rooney.  The blanks are where he didn’t play (as opposed to the zero values which are where he had no shots).














All of the above is still just scratching the surface (even before the more detailed release of within game player actions) but hopefully begins to make the point about creating open source (and approved) modified datasets to avoid large scale duplication of work as well as issues around differing definitions.

Update - 14th Sep.

I have now created a spreadsheet in the same format as the original dataset which has 40 records containing own goals data.  If you add this to the original spreadsheet then the 'Goals' and 'Goals Conceded' totals will now tally.

Spreadsheet is available at: https://skydrive.live.com/redir?resid=A1BA00769DC2D906!105

Dan Barnett

Director of Analytics