Follow by Email

Wednesday, 23 January 2013

Comparison of Opta stats providers

It may just be self-selection based on the kind of things I read, but there seems to be an ever growing interest in data in football and the subject appears to be moving away from the niche into the mainstream with increasing mentions in the press such as a recent article in The Guardian.

This is partly due to more and more sites making use of data in football, in particular from Opta.  In this post I'll look at the pros/cons of a number of sites/apps that use Opta data and their comparative strengths and weaknesses.

When Swansea City reached the Premier League with promotion in May 2011, I decided to set up the blog www.wearepremierleague.com to combine my interest in stats with that of the Swans. Generally speaking there is a paucity of (publicly available) data around activity in lower leagues - although credit must go to Ben Mayhew for his attempt to rectify this at Experimental 361. The level of detail publicly available for the top leagues in Europe however is still far beyond that in the Championship and below.

Guardian Chalkboards
When I started out, this was one of the few resources about and had the advantage of being free and web (not app) based.  I won't go in to too much detail about it as its sadly no more (possibly ahead of its time?) but the thing I liked most about it was to be able to visualise the activity with regard to where on the pitch it took place.

The image below shows a Swansea goal against Blackburn where every Swansea player touched the ball in the move.
The addition of squad numbers to activity gives a level of detail not available anywhere else I've looked
Stats Zone
The demise of Guardian Chalkboards a couple of months into the season was the nudge I needed to get an iPod Touch to be able to use the Stats Zone app.

Stats Zone is great for both looking at the top level stats (e.g., Shots per Team) or delving in to the detail of a particular match (e.g., Long passes by a particular player).
Example of a Stats Zone Screen shot, in this case comparing the Aerial Duel activity of Peter Crouch with the Stoke team as a whole
Stats Zone is produced in conjunction with FourFourTwo magazine and their website includes blogs produced by Opta and Zonal Marking and others.

I combined my interest of football with that of data analysis in the creation of a Premier League Review dashboard, which is a interactive presentation taking a number of images from Stats Zone.

WhoScored.com
Whenever Swansea are linked with a particular player (usually from La Liga), WhoScored is the first site I go to as it has in-depth details for any player across the major European leagues:
WhoScored has details both on overall activity for that season as well as the ability to drill down in to activity for a particular game
WhoScored also has a fairly comprehensive list of stats for any particular match with the ability to order ascending/descending on these metrics for each played within a team (Long Balls, Chances Created etc,.) and also blogs from a number of respected writers.

Squawka Sports
Squawka.com is to some extent a cross between Stats Zone and WhoScored in that you can look at activity of individual players across the season as a whole, but also look at specific types of actions graphically for a specific player in a particular match e.g., Canas' passes vs. Malaga
Squawka goes for a dashboard approach for presenting a lot of its data
EPLIndex.com
The level of detail available in the sites/app mentioned above will be enough for the majority of people but for those wanting even more, there is the pay-for site EPLIndex.com (£3.95 a month/£40 a year) which has even more detail.

Where WhoScored for example might have total passes and pass accuracy, EPL Index will break this down even further e.g., Passes/Accurate passes in Own Half/Attacking Half/Final Third:
EPL Index Screenshot - huge amount of data across numerous tabs
The level of detail of this data is pretty much the same as the release from Opta/Manchester City of the summary stats for the 2011/12 season, just not in a single spreadsheet.

One of the other advantages of EPL Index is that it has data for multiple seasons making comparisons such as one I did recently comparing Danny Graham and Kenwyne Jones possible:
Example of the kind of thing its possible collate using data supplied by EPL Index
As well as the option of subscribing to stats, for those who just want to read about stats and football the site has an ever growing number of authors who use the data to write and publish their own analysis to a level of detail which is arguably a depth of analysis rarely seen anywhere.

Relative Strengths and Weaknesses of each source

Stats Zone - Strengths:
  • Ability to visualise activity e.g., location of Shots/Interceptions etc., 
  • Includes simple top level summaries e.g., total tackles made ordered by all players not split by team as is the case in the other sources
  • Ability to drill into data within the game e.g., compare first 62 minutes with last 28
  • Ability to create bespoke comparisons across matches/teams e.g., Chances made by John Walters in first 30 minutes vs. Aston Villa compared to Chances made by Stoke vs. West Brom  
Stats Zone - Weaknesses:
  • Apple devices only - no Android or Web version
  • Lacks ability to see multiple stats simultaneously e.g., Tackles/Passes/Shots per player
  • Doesn't have stats collated across a season

WhoScored - Strengths:
  • Includes data on all major European Leagues and Champions League
  • Easiest site to navigate around between stats for Team/Player/Match
  • Best for comparing statistics across teams, form/shots per game
WhoScored - Weaknesses:
  • Little visualisation of data - there is a nice image of shot areas but not the chalkboards such as those from Stats Zone/Squawka
  • No ability to analyse activity within a game e.g., compare 1st and 2nd half stats

Squawka - Strengths:
  • Has ability to easily track metrics for a team or player for a single match or across season
  • Includes heat maps of activity by player/team
  • Ability to drill down within part of the game (currently 5 minute intervals)
  • Lots of charts as well as raw data, multiple options for visualising the same data
Squawka - Weaknesses:
  • Doesn't have the same level of detail of stats readily available as other sites although only likely to bother the really in-depth user
  • Good to have charts but some could be better e.g., if a player has played in 15 of 22 league games only stats for those 15 shown.  Personally would like to see the blanks to know where over the season that player hasn't featured
  • Stats Zone plots chalkboards from the point of viewing of the team your analysing attacking from left to right, Squawka plots them with Home team playing from left to right which can be annoying when trying to compare areas of attack/passing

EPL Index - Strengths:
  • Most in-depth of any of the data sources
  • Has league data going back to 2008/9 season
  • Top-Stats feature gives ability to find best players across a range of metrics with ability to filter by those playing at least x minutes in a game or total minutes across a season (e.g., avoids problem of someone coming top in pass completion % with 1 pass from 1 attempt)
EPL Index - Weaknesses:
  • Pay-for site
  • No ability to analyse activity within a game
  • Generally best thought of as a source of data from which you create something yourself 

Turning Data into Insight
Although each of these companies is taking the same (or at least similar data) from Opta, it can be seen that they have each used it in different ways and are all still improving as time goes on. Eventually I'd imagine that one of these sites (or a newer entrant such as Sky) will bring all these parts together, possibly also including video for a complete experience.

As an example, a lot has been made recently about David de Gea pushing balls back into dangerous areas when he makes a save, the raw data will only tell you so much but to be able to view all his saves or saves where there is a goal in subsequent 10 seconds would give an even more detailed picture.

TV rights are far to precious to be given away but the ability to create your own highlights package (e.g., All chances created by Pablo Hernandez, with approx 10 seconds of footage per chance created) could take interactive entertainment to a new level.

Other Posts:  Man City and TwitterTwitter and Bookies - A Case Study , Premier League Weekly Review

Monday, 21 January 2013

Manchester City and Twitter

Following on from previous posts looking at football and Twitter, this current post looks at some of the activity on Twitter from Manchester City. Man City are current Premier League holders and arguably the richest club in the world so to some extent getting an extra few retweets isn't the most important thing for the club.

That said, with Financial Fair Play being introduced (where clubs have to sort-of be self-sufficient, although as it involves UEFA who knows how it'll actually play out), the club needs to maximise its off-field revenues and Social Media will naturally be part of that.

Man City currently have almost 680k followers and have the fourth highest following in the League behind, Arsenal, Chelsea and Liverpool (Manchester United are currently not on Twitter which is presumably a strategic decision but would likely gain several million followers within a matter of hours if they ever decide to join, such is their global reach).

I've mentioned in a previous blog about 'Sweating the Assets' as an organisation such as Man City will have huge amounts of interesting content but as people are wary about repeating a message for fear of annoying followers quite often the value isn't maximised.

Man City in part have taken this on-board with the example of promoting a video of one of their players John Guidetti who has just started playing again after injury:
Man City tweeting a link to the same article 3 times over the space of a few hours
As seen previously when looking at response to tweets from other accounts, response drops off dramatically within a matter of minutes so there's no reason to worry too much about over-promoting when mentioning the same thing 3 times over the space of 9 hours, there's a drop off in number of retweets but not the levels that you would see if you were repeat mailing the same people.  It also makes sense for an organisation that's looking to promote itself globally to take different time-zones into consideration.

The one thing I would suggest however is adding a dummy part to the URL so the shortlinks produced for each of the tweets are unique to give a true understanding of response per tweet.  There's more details on creating unique links in this blog.

Although the example above shows that they are considering multiple posts, generally speaking, items get posted only once, below are a couple of examples of recent tweets that have sufficient interest to be reposted but which were only posted once:
Recent Man City tweets showing the rapid decay in click-through, the Gallery tweet generating almost as much response in the first 7 minutes as the subsequent hour
Twitter is very much a 'blink and you'll miss it' medium and generally speaking most users won't have set up lists of key accounts such as the football team they support so there is real value in repetition along with considering peak usage times across the globe.

Friday, 7 December 2012

How BBC Breaking News Track Individual Tweets

In a previous blog I mentioned how it's possible to separately track the activity of numerous tweets that all point to the same page by adding dummy text to the web address e.g., www.analysismarketing.com/#blogtest

This means when you use a tweet shortening service (either directly within Twitter or using a tool such as Bit.ly) each unique address e.g., /#01, /#02 will be given a different shortened link.  Without this, every time you tweet a link to the same page it generates the same shortened link making it difficult to attribute activity to individual Tweets.

A good example of an organisation that puts this in to the practise is the BBC with the @BBCBreaking account.  Depending on how big a story is, there might be several tweets linking to the same article.  An example of this was the arrest of Max Clifford.

@BBCBreaking sent out two tweets, one at 1:11pm and one at 1:19pm on Thursday 6th Dec.  Both were to the same webpage but as there were dummy details added, they have separate short-codes:
Click-Through volumes from the two tweets.  The first is around 40% higher overall but the response curves have similar patterns
The two tweets have different addresses: http://www.bbc.co.uk/news/uk-20627765#TWEET424921 and http://www.bbc.co.uk/news/uk-20627765#TWEET424943 but go to the same physical page (i.e., everything from the # onwards gets ignored).

If you're promoting the same article/page on numerous occasions, this method is a way of tracking the individual piece of activity that has driven that click.  This principle works across wherever you put the weblink (Twitter/Facebook/LinkedIn/Email etc.,) and gives you a lower level of detail than would otherwise be available.


Wednesday, 28 November 2012

Twitter and Bookies - Case Study

My interest in numbers probably started from a very early age working out the expected winnings from my Auntie's accumulator or working out the number of possible doubles and trebles from horses picked to be able to calculate stake money.

I didn't realise it at the time, but working out the return from a 50p double on a 6/4 and 15/8 shot was a great intro into probability and ever since I've had a huge interest in betting.  I rarely bet however partly because I know the bookie marks up odds to give a total probability of over 100% so I would expect to lose money in the long run (unless I take the view that I'm better placed to allocate odds than the bookmaking industry, which I'm sure has led to the bankrupting of many an 'expert').

Due to this interest, I've used Bookmakers as my most recent example of organisations using Twitter, I've put together a case study that looks at the Twitter following of a number of major accounts involved in the betting industry.

Some of the main points are:

  • Huge variation in overall followers from some big accounts such as Paddy Power (@paddypower) which is now around 100k followers down to Bet Victor (@betvictorfans) which has only a few thousand.
Twitter follower volumes at time of analysis (9th-11th Nov)
  • Average (median) follower of one (or more) of these accounts follows 165 accounts and is followed themselves by 38.  There is a huge disparity in follower volumes between the Twitter influencers and the average user
  • Over three-quarters of the c268k following one or more of the Twitter accounts we looked at followed only one of the accounts which was surprising.  Where a serious punter might have several accounts to enable them to get the best odds on a market, it's arguable that an average punter might just stick with one.  This is probably reflected in the relatively generous free bet offers that bookies use to tempt new punters.  If you are serious punter there's a big difference between 6/4 and 15/8 but if you're putting your money on markets such as first goalscorer where the bookies margin is far bigger, getting fractionally better odds is far less of an issue.
Of those following 1 or more of the accounts analysed, total number of those accounts followed
  • 'Come for the banter, stay for the odds' seems to be the motto of the more successful Twitter accounts with a number of the ones with bigger followings including jokes, pictures of WAGs in their underwear etc., as a means of getting their account visible beyond their own followers by attracting retweets (and hopefully as a result extra followers).
  • Twitter can be a lot of work for little reward, the sheer volume of tweets means that your message can easily get lost in the mass unless it has something that makes it stand out.  Analysing some of the responses from the bit.ly from some of the accounts often gave click-through values in single digits. 

The full case study is available to download at the Analysis Marketing website.

Thursday, 25 October 2012

Why 'everything' is a database


There was a character in the late 90s sketch show ‘Goodness Gracious Me’ who kept annoying his son by claiming that everyone of note came from India:
Da Vinci? Indian. The Queen? Indian. Picasso? Indian.

I have a similar trait to that character except my ubiquitous reference is ‘Database’:
Google? Database. Facebook?  Database. Twitter? Database.

Ultimately all big organisations are doing the same thing, just in slightly different ways: they all collect huge amounts of data with the difference being how they pass that back to users with they key being how they store, manipulate and disseminate.

What’s all this got to do with football?  Well, looking at the MCFC Analytics data I was struck by the similarities between this and the kind of data you might see within a normal customer database, the data is provided at a level of one record per player per match which could be considered to be like items from an order, each order has multiple items and each customer (Team) has multiple orders.

From here the natural step is to turn a load of data into summary views which would provide the starting point of any analysis which in database marketing terms would be:

Single Team View – One record per Team
Single Match View – One record per Match
Single Player View – One record per Player

The insight usually comes not just from aggregating the raw data but from manipulating it to create extra variables which give a greater depth of understanding beyond just totals and averages.

The first one of these I have put together is the single team view, the main part of this is just totalling the details of the individual players (along with the own goals data) but also adding other details added in around each team.

This produces a table of nearly 200 hundred columns, so is fine as a data source but looking at it for any length of time will give you a headache.  The job of any analyst should be to be able to take this and make something more user friendly.

To that end I have produced a summary dataset called single team view summary.xls which is one record for each of the teams which as well as having the usual goals scored/conceded also has some other information which I think is pretty interesting.

Much has been made about Newcastle possibly punching above their weight (i.e., lucky) and possibly in store for a more average season this time.  It’s certainly true that there are a number of stats which suggest they over performed:
  • Newcastle only had more shots than the opposition in 15 of their 38 games around half of the number of teams around them in the table.
The top 4 (plus Chelsea and Liverpool) had more shots than the opposition in the majority of their matches
  • They conceded 2 ‘Big Chances’ for every one ‘Big Chance’ they had (ratio of 0.67 Big Chances created per Big Chance conceded), Chelsea are the only other top half team where the ratio is less than 1.  Where a 'Big Chance' is described as an opportunity where a goal would be expected.
For this metric, the top 4 (plus Everton, Liverpool and Fulham) are the only sides to create more 'Big Chances' than they concede
  • For the majority of their games, Newcastle had fewer passes and fewer final third passes than their opponents where the rest of the top 6 dominated.
The traditional 'Big Six' were the teams that tended to dominate passing (especially final third passes), with Swansea and Stoke being outliers.

Liverpool were arguably the opposite of Newcastle in terms of dominating games but not seeing it returned in points but although luck may play some part in results, the ability to be clinical in front of goal (Newcastle:11.5% of shots were goals) or not (Liverpool: 7%) is not some random event but is arguably something a manager may have little control over on the day itself but does in terms of signings and selection.

Other things of interest were Swansea making more passes than the opposition in 33 of their 38 games, but only more final third passes in 9 games with Stoke being the opposite, having just 3 games where they made more passes but 12 where they made more final third passes.

There are an almost infinite number of ways of reformatting the MCFC Analytics dataset and the output above is only the tip of the iceberg.  Given the amount of data involved it may be that collaboration and sharing of datasets is the fastest way to gain an overall understanding of the data.

The spreadsheet behind the figures above (which contains a number of other derived metrics including home/away splits) is available at: https://skydrive.live.com/redir?resid=A1BA00769DC2D906!105 along with the Own Goals data and other Premier League related output.

Dan Barnett
Director of Analytics


Friday, 19 October 2012

Twitter Analysis - Ben Goldacre

Previous posts have focused around the Twitter activity of journalists at The Times promoting their articles.  This blog looks at the activity of someone who appears to have a great understanding of making the most of Twitter.

Ben Goldacre is a doctor who is arguably best known for his bad science articles in the Guardian (and book of the same name), he has over 230k Twitter followers so must be doing something right.

The reason I have picked Ben for this blog is that he is a good example of someone who is willing to repeat his message (but not in a spammy way), a simple example of this is where he tweeted a link to his article around Glaxo SmithKline.

The tweets linking to the same article were sent out at 9:37pm and 10:57pm on the 11th Oct and also 10:36am on the 12th (oldest one displayed first):
The response by hour shows how the third Tweet has almost double the response of the initial tweet (there were sent at almost the same time past the hour so a pretty fair comparison can be made between the two).  It's possible that just after 10:30 on a Friday morning is the perfect time to hit people on a mid-morning break looking for something interesting to read to distract them from work.
  Response by Hour to the link mentioned in the Tweets

There are other tweets in between these so it is not as if Ben is just hammering home a single point with nothing else to say.

Another good thing that Ben does is not assume that anyone reading any single tweet will know the whole context of what he is saying, rather than just linking to something once and then sending follow up tweets talking about that subject, Ben includes the link for reference in each tweet (as seen below, again there will be tweets on other areas between these tweets).

A series of tweets around the same topic (most recent first), there's every chance that a follower could first be reading Ben's tweets on this subject at any point so the link helps to provide context (and drive activity).

If you're only following 10 people on Twitter then obviously this would be quite annoying but generally people are following 100+ accounts and not checking their timeline every 5 minutes so the risk of over-exposure is minimal even if I did see someone sarcastically tweet that they didn't realise Ben has a book out at the moment.

Find out more about how we can help you with data at www.analysismarketing.com

Dan Barnett
Director of Analytics


Thursday, 11 October 2012

It's not just what you say, it's how you say it

In previous blogs looking at the activity of The Times dropping the paywall an hour at a time for selected articles, I've looked at the value of resending the same/similar message.  In this example, I look at the fact that it's not just follower volumes that's important it's relevance (and also the message itself).

In a piece on the recent sponsorship deal with Wonga for Newcastle United, George Caulkin rails against the increasingly depressing impact of business on football.  This was sent at 4pm on Tue 9th October with the article being free to view between 4pm and 5pm
This was retweeted by a few other people but as of 4.30pm had only had a few hundred clicks even though George has over 34k followers (not a bad resposnse for a tweet though).

By the end of the hour though, the link had been clicked over 2,600 times.











This was due in part to George promoting the article again with a follow up tweet:
This tweet was then retweeted at 4:42pm by Joey Barton who has 1.7m followers, creating the first of the two large spikes.  Joey had already promoted the article with the direct link (and had some Tweets back and fore with George).

The second spike was due in part a tweet at 4.52pm from Mirror reporter Ollie Holt which both praised the article and also reminded people that there was only a few minutes to go before the article was no longer free.
Despite the fact that Ollie Holt with 154k followers has less than a tenth of the followers of Joey Barton, it would appear that Holt has generated a greater response.

This will be for a number of reasons: the piece is personally endorsed rather than just retweeted (where it will appear as coming from George Caulkin with just details of 'retweeted by Joey Barton' at the bottom) and there is also a direct call to action: 'Read it quickly. Only free until 5pm'.

As mentioned in other posts, the details above are for visits using the Bitly link mentioned in the tweets, there will be cases where people have found it themselves or choose to link directly without the Bitly link so these figures are more the impact of initial tweets not the overall activity to that page.

It can sometime seem like Social Media is a whole new world and all the rules of marketing have changed but that's often not the case.  As can be seen from the impact of the enthusiastic endorsement of Ollie Holt's tweet combined with the time limited call to action a lot of the traditional methods of generating response are still valid.

Dan Barnett

Director of Analytics