Statistics to prove anything: 2010

Monday, October 25, 2010

How to Compare Boozer and Jefferson

I am a big fan of basketball, and my favorite team has always been the Utah Jazz. This past summer Carlos Boozer, the Jazz's starting Power Forward left the team to join the Chicago Bulls via a sign-and-trade. This allowed for Carlos Boozer to sign a larger contract, and gave the jazz a "trade exception." Basically a trade exception allows for a team to be involved in a trade with another team where the first team receives players with contracts worth more than what the first team sends away (see NBA Salary Cap: Other exceptions). The Jazz then turned around, and with the help of their exception traded for Al Jefferson in exchange for Kosta Koufos, 2 draft picks, and a bag of potato chips (well, pretty much). At the end of the day, its essentially like the Jazz exchanged Boozer for Jefferson.

On July 7th, just days before the Jefferson trade, the Deseret News reported Utah Jazz fans have mixed reactions about Boozer, where Boozer's tenure in Utah is summarized as being "polarizing". Boozer was an All-Star caliber player, but missed a third of his games in Utah, and many fans felt that last year he didn't want to be back in Utah. Either way, losing Boozer was going to impact the future of the Jazz.

After the trade, however, the future of the Jazz seemed much brighter. If nothing else, Al Jefferson was excited to be in Utah, and who wouldn't be after being on the Timberwolves? I'm excited for Al Jefferson, and my friends who are fans of the Utah Jazz on facebook seem to be as well. Bill Simmons is also excited for Jefferson, he writes in his must-see list column:

On a personal note, I loved Big Al on the Celtics, rooted for him in Minnesota, felt like a proud dad when his career was taking off in January '08, agonized for him when he blew out his knee that same month, then felt terrible last season as he struggled through one of the worst situations in recent memory. Seeing Jefferson get a chance to rejuvenate his career with a top-four point guard and a top-four coach -- on a good team, in a city that gives a crap -- is one of my five favorite things about this upcoming season. So there.

Simmons isn't the only one who thinks Jefferson has a chance to emerge as a star in Utah's system. Sekou Smith shares his opinion in his blog entry Big Als Time To Shine!

While debating the merits of Al Jefferson over Carlos Boozer as the low post catalyst for the Utah Jazz, someone informed me that Jefferson’s numbers on losing teams don’t compare the stats Boozer put up in a winning situation in Utah the last six years.
I snapped. Seriously, I lost it.
When told that solid numbers on a bad team mean nothing, I couldn’t hold my tongue. It’s the most ridiculous thing I’ve heard. Jefferson’s work during his final season in Boston and his three seasons in Minnesota all speak to the talent he has honed during his six NBA seasons.
It’s not his fault he’s been on bad teams in rebuilding situations at every stop. But to assume that his numbers (Jefferson was a 20-10 man during his three seasons in Minnesota and has career averages of 15.3 points, 8.7 rebounds and 1.2 blocks) are meaningless on a bad team is not only an insult to my basketball sensibilities, it’s also downright foolish.

I like what Simmons and Smith have to say about Jefferson. Simmons sees Jefferson's potential and intangibles and recognizes that he can excel in the new situation he has found himself in. Smith sees Jefferson's numbers as a representation of the type of player that he is, and argues that the Jazz will at least be as good as they were with Boozer. I think they both make excellent points.

David Berri disagrees, at least with Smith's posting. He responds to Smith's blog post on the website OpposingViews.com with his article: Not Even Close: Carlos Boozer vs. Al Jefferson where he uses the metric WP48 to refute Smith's opinion. Essentially, Barri uses this metric to show that Boozer contributes more to his team's success than Jefferson does.

I had never heard of the metric "WP48" before so I took a look at the provided links. WP48 stands for "Wins Produced per 48 minutes" and is calculated through a series of adjusting for how teams play and other players play at particular positions. Details are provided here. After reviewing the metric I decided that I'm not a fan, and I'd like to make a few points.

I find it indefensible that the defensive aspect of the metric is averaged over the entire team, for the entire game, even when the player isn't playing! That means that a player is penalized/rewarded not only for the other 4 players on the court with them, but for every player that comes off the bench. Furthermore, some teams get away with having sub par defensive play from guards by having defensive big men patrolling the paint, or some other combination of good/bad defenders on the court at the same time. Both Boozer and Jefferson are known as sub par defenders, but the Jazz certainly had a better defensive team as a whole than the Timberwolves. Jefferson is thus penalized more than Boozer for defensive play by the metric.

Despite the numerous adjustments made throughout the computation of the metric, it still seems most appropriate in comparing players on the same team, and at best it compares players that play similar styles on similar teams.

The Timberwolves ran an offense that was terrible for Jefferson, the triangle just didn't work for them. They also didn't have an all-star point guard to get him the ball in good places. Boozer may have had a higher scoring efficiency, but that isn't surprising since he also played for a team with a more efficient offense.

Although the metric adjusts for minutes played by per 48 minutes played, it still doesn't account for the spacing on the floor. What I mean specifically is that Kevin Love was one of the best rebounders in the league last year, and played in a way that was not cohesive with Al Jefferson being on the floor at the same time. The minutes they played together hurt Jefferson.

Also, I am not a fan of reducing data of this sort down to a single number. Don't get me wrong, I realize the importance of being able to reduce high dimensional data to dimensions that are interpretable, but a single value in this case is just too much. I don't see why you can't compare different aspects of their game together and decide that in one player is better in those certain aspects.

I don't have a good answer as to whether Jefferson will be better for the Jazz than Boozer. I don't know if there are good metrics for determining how well a player will do in a new system. I'm reminded of a conversation I once had with my former roommate Eric. I told him about how I was going to model my predictions for March Madness and Eric said "I just watch a lot of basketball and pick the teams that I think are better." In this case I'm going to trust the guys that watch a lot of basketball.

I do have a method for comparing players though. A few weeks ago I made some biplots to help prepare myself for fantasy basketball. The center of the plot is the league average in each category for PFs, and the stats are "per game." Each player projects down onto the categories at a right angle. Boozer can be found just above DREB and Al Jefferson is between DREB and OREB in a little more toward the center. You'll have to click on the image and zoom in to really read the names.

Biplot of Power Fowards 09-10 NBA Season

I love these plots, they summarize a lot of information. It can be seen that Boozer clearly had a better statistical year last year than Jefferson, but not by relative much. Boozer looks like he puts up all-star numbers, while Jefferson looks like he is a top ten power forward. And again, this data is all from last year and doesn't tell us how Jefferson will do this year.

I hope the title of this blog posting didn't mislead you into thinking I had a good answer. I'll have think about how to better model how well players would perform in different systems.

Monday, October 18, 2010

My Favorite Online R References

I don't by any means claim to be a great and efficient R programmer, but it is certainly something that I aspire to be. The most important thing I've found is that I don't have to use a lot of brain power trying to remember every R command created. I just have to remember that there are R commands that have the ability to do things I want, and then it just takes a moment or two to look up what I want.

From time to time I think I'll post some of my favorite functions on here that I always seem to forget about that make life easier. For example the "which" function, which can often be replaced by using brackets in a clever manner, but can be invaluable in other instances. Another is the "match" function, which is great when you have, say, a sub list of rownames from a full list that you want to do something with. Another interesting function that was useful last week was the "jitter" function, which adds a specified amount of random error elements in a vector

Some of my favorite online references are listed below. When I want to know how to do something in R, before searching through google, I'll often check these out first:

Quick R
A great site which has examples of using R for many different types of statistical analyses

R Graph Gallery

A great collection of graphs and plots

http://rgraphics.limnology.wisc.edu/index.php

Some basic plotting examples

One R Tip A Day

I haven't explored this site as much, but it has some nice things

When I find myself googling for specific examples it often results in searching through the R mailing lists archive. More likely than not it'll take a few rewordings of my question before I find exactly what I want. And I somehow have overlooked the obvious help files in R found by typing in ?function, as well as the great manuals found on the The Comprehensive R Archive Network

Sunday, October 3, 2010

Chart of different lty values in R

Here we can see the lty values available in R:

Only a few options, but using different colors and lwd sizes allows for many different options. Below is an example of how using different lty values helps distinguish between different types of data:

And the code:

plot(1, type="n", axes=F, xlab="", ylab="",xlim=c(0,10),ylim=c(0,20),

main="List of lty values in R",cex.main=2)

lines(c(0,10),c(17,17),lwd=2,lty=1)

text(2,16,'lty = 1 (default)',pos=4)

lines(c(0,10),c(14,14),lwd=2,lty=2)

text(2,13,"lty = 2 -or- lty = 'dashed'",pos=4)

lines(c(0,10),c(11,11),lwd=2,lty=3)

text(2,10,"lty = 3 -or- lty= 'dotted'",pos=4)

lines(c(0,10),c(8,8),lwd=2,lty=4)

text(2,7,"lty = 4",pos=4)

lines(c(0,12),c(5,5),lwd=2,lty=5)

text(2,4,"lty = 5",pos=4)

lines(c(0,10),c(2,2),lwd=2,lty=6)

text(2,1,"lty = 6",pos=4)

plot(1,type='n',axes=T,xlab='',ylab='',xlim=c(0,10),ylim=c(1,10),main='Plot of Highway Data',cex.main=2)

abline(3,.75,lwd=2)

abline(1,.75,lwd=3,lty='dashed',col='yellow2')

abline(-1,.75,lwd=2)

lines(c(4,5),c(5,5.75),lwd=25,col='grey')

points(3.7,4.7,bg='red',pch=23,cex=2.5)

lines(c(3,3.5),c(2.1,2.475),lwd=20,col='blue')

points(3.2,2.25,cex=2,pch=19,col='skyblue4')

lines(c(8,8.3),c(8.2,8.45),lwd=20,col='green4')

points(8.19,8.35,cex=2,pch=23,bg='green3',col='green4')

Friday, September 24, 2010

Charts of different pch values in R

I always forget what each pch values is and I always end up going to this site to see the list: http://rgraphics.limnology.wisc.edu/pch.php. But I don't like having to click on the thumbnail image to then see the large version. Here is my version:

I don't understand why it has the value '0', and if there is a difference between 16 and 19 I have no idea what it is. and of course 20 could just be 19 with cex=.5

Here are some other pch options, basically any character on the keyboard can be used:

Of course some of these are better than others. For example, if you are plotting financial data, it might be best to consider something like this:

Or if you have confusing data this may be appropriate:

And of course the purpose of using different pch characters is to easily distinguish different type of data on the same plot. For example, here we see the ^ characters representing Teepees, and the ~ characters representing water:

To choose a particular pch for a plot, the code is either plot(x,y,pch=16) or plot(x,y,pch='~')

Source Code for above plots:
plot(c(.5,10),c(0,18),col='transparent',axes=F,

xlab='',ylab='',main='List of pch values in R',cex.main=2)
points(5,17,pch=0,cex=2)
text(5,16,'0')
points(c(1,3,5,7,9),c(rep(14,5)),pch=c(1:5),cex=2)
text(c(1,3,5,7,9),c(rep(13,5)),c('1','2','3','4','5'))
points(c(1,3,5,7,9),c(rep(11,5)),pch=c(6:10),cex=2)
text(c(1,3,5,7,9),c(rep(10,5)),c('6','7','8','9','10'))
points(c(1,3,5,7,9),c(rep(8,5)),pch=c(11:15),cex=2,bg='blue')
text(c(1,3,5,7,9),c(rep(7,5)),c('11','12','13','14','15'))
points(c(1,3,5,7,9),c(rep(5,5)),pch=c(16:20),cex=2)
text(c(1,3,5,7,9),c(rep(4,5)),c('16','17','18','19','20'))
points(c(1,3,5,7,9),c(rep(2,5)),pch=c(21:25),bg='lightblue',cex=2)
text(c(1,3,5,7,9),c(rep(1,5)),c('21','22','23','24','25'))

plot(c(.5,10),c(0,10),col='transparent',axes=F,
xlab='',ylab='',main='Some other pch values in R',cex.main=2)
points(c(1,3,5,7,9),c(rep(9,5)),pch=c('A','a','0','O','o'),cex=2)
points(c(1,3,5,7,9),c(rep(7,5)),pch=c('*','.','|','+'),cex=2)
points(c(1,3,5,7,9),c(rep(5,5)),pch=c('^','_','-','/','#'),cex=2)
points(c(1,3,5,7,9),c(rep(3,5)),pch=c('!','@','$','%','&'),cex=2)
points(c(1,3,5,7,9),c(rep(1,5)),pch=c('(',')','<','>','~'),cex=2)
mtext("Line 2", side=1, line=2, adj=0.0, cex=1, col="blue", outer=TRUE)

x<-seq(1,10,by=0.3333)y<-10+.25*x+rnorm(length(x),0,.5)
plot(x,y,pch='$',axes=F,main='Plot of Financial Data',cex.main=2,cex.lab=1.5)
box()

x<-seq(1,10,by=0.3333)y<-10+.25*x+rnorm(length(x),0,.5)
plot(x,y,pch='?',axes=F,main='Plot of Confusing Data',cex.main=2,cex.lab=1.5)
box()

x<-seq(2,7,by=.02)
y<-x+.25*x^3+rnorm(length(x),0,10)
x2<-seq(3,5,by=0.06)
y2<-22+10*x2+rnorm(length(x2),0,6)
plot(x,y,pch='~',main='Plot of Village by a River',col='blue',cex=2,cex.main=2,cex.lab=1.5)
points(x2,y2,pch='^',col='red',cex=2)