[Offtopic] R programming language

stephen at melbpc.org.au stephen at melbpc.org.au
Wed Jan 7 21:32:09 EST 2009


Data Analysts Captivated by R Power

By ASHLEE VANCE  www.nytimes.com  Published: January 6, 2009 

. R is the name of a popular programming language used by a growing 
number of data analysts inside corporations and academia. 

It is becoming their lingua franca partly because data mining has entered 
a golden age, whether being used to set ad prices, find new drugs more 
quickly or fine-tune financial models. Companies as diverse as Google, 
Pfizer, Merck, Bank of America, the InterContinental Hotels Group and 
Shell use it.

But R has also quickly found a following because statisticians, engineers 
and scientists without computer programming skills find it easy to use.

“R is really important to the point that it’s hard to overvalue it,” said 
Daryl Pregibon, a research scientist at Google, which uses the software 
widely. “It allows statisticians to do very intricate and complicated 
analyses without knowing the blood and guts of computing systems.”

It is also free. R is an open-source program, and its popularity reflects 
a shift in the type of software used inside corporations. Open-source 
software is free for anyone to use and modify ..

R is similar to other programming languages, like C, Java and Perl, in 
that it helps people perform a wide variety of computing tasks by giving 
them access to various commands. 

For statisticians, however, R is particularly useful because it contains 
a number of built-in mechanisms for organizing data, running calculations 
on the information and creating graphical representations of data sets. 

Some people familiar with R describe it as a supercharged version of 
Microsoft’s Excel spreadsheet software that can help illuminate data 
trends more clearly than is possible by entering information into rows 
and columns. 

What makes R so useful — and helps explain its quick acceptance — is that 
statisticians, engineers and scientists can improve the software’s code 
or write variations for specific tasks. Packages written for R add 
advanced algorithms, colored and textured graphs and mining techniques to 
dig deeper into databases. 

Close to 1,600 different packages reside on just one of the many Web 
sites devoted to R, and the number of packages has grown exponentially. 

One package, called BiodiversityR, offers a graphical interface aimed at 
making calculations of environmental trends easier. 

Another package, called Emu, analyzes speech patterns, while GenABEL is 
used to study the human genome. 

The financial services community has demonstrated a particular affinity 
for R; dozens of packages exist for derivatives analysis alone. 

“The great beauty of R is that you can modify it to do all sorts of 
things,” said Hal Varian, chief economist at Google. “And you have a lot 
of prepackaged stuff that’s already available, so you’re standing on the 
shoulders of giants.”

R first appeared in 1996, when the statistics professors Ross Ihaka and 
Robert Gentleman of the University of Auckland in New Zealand released 
the code as a free software package. 

According to them, the notion of devising something like R sprang up 
during a hallway conversation. They both wanted technology better suited 
for their statistics students, who needed to analyze data and produce 
graphical models of the information. Most comparable software had been 
designed by computer scientists and proved hard to use. 

Lacking deep computer science training, the professors considered their 
coding efforts more of an academic game than anything else. Nonetheless, 
starting in about 1991, they worked on R full time. “We were pretty much 
inseparable for five or six years,” Mr. Gentleman said. “One person would 
do the typing and one person would do the thinking.”

Some statisticians who took an early look at the software considered it 
rough around the edges. But despite its shortcomings, R immediately 
gained a following with people who saw the possibilities in customizing 
the free software. 

John M. Chambers, a former Bell Labs researcher who is now a consulting 
professor of statistics at Stanford University, was an early champion. 

At Bell Labs, Mr. Chambers had helped develop S, another statistics 
software project, which was meant to give researchers of all stripes an 
accessible data analysis tool. It was, however, not an open-source 
project. 

The software failed to generate broad interest and ultimately the rights 
to S ended up in the hands of Tibco Software. Now R is surpassing what 
Mr. Chambers had imagined possible with S. 

“The diversity and excitement around what all of these people are doing 
is great,” Mr. Chambers said.

While it is difficult to calculate exactly how many people use R, those 
most familiar with the software estimate that close to 250,000 people 
work with it regularly. 

The popularity of R at universities could threaten SAS Institute, the 
privately held business software company that specializes in data 
analysis software. SAS, with more than $2 billion in annual revenue, has 
been the preferred tool of scholars and corporate managers. 

“R has really become the second language for people coming out of grad 
school now, and there’s an amazing amount of code being written for it,” 
said Max Kuhn, associate director of nonclinical statistics at 
Pfizer. “You can look on the SAS message boards and see there is a 
proportional downturn in traffic.”

SAS says it has noticed R’s rising popularity at universities, despite 
educational discounts on its own software, but it dismisses the 
technology as being of interest to a limited set of people working on 
very hard tasks. 

“I think it addresses a niche market for high-end data analysts that want 
free, readily available code," said Anne H. Milley, director of 
technology product marketing at SAS. She adds, “We have customers who 
build engines for aircraft. I am happy they are not using freeware when I 
get on a jet.”

But while SAS plays down R’s corporate appeal, companies like Google and 
Pfizer say they use the software for just about anything they can. 

Google, for example, taps R for help understanding trends in ad pricing 
and for illuminating patterns in the search data it collects. Pfizer has 
created customized packages for R to let its scientists manipulate their 
own data during nonclinical drug studies rather than send the information 
off to a statistician. 

The co-creators of R express satisfaction that such companies profit from 
the fruits of their labor and that of hundreds of volunteers. 

Mr. Ihaka continues to teach statistics at the University of Auckland and 
wants to create more advanced software. Mr. Gentleman is applying R-based 
software, called Bioconductor, in work he is doing on computational 
biology at the Fred Hutchinson Cancer Research Center in Seattle. 

“R is a real demonstration of the power of collaboration, and I don’t 
think you could construct something like this any other way,” Mr. Ihaka 
said. “We could have chosen to be commercial, and we would have sold five 
copies of the software.”

» A version of this article appeared in print on January 7, 2009, on page 
B6 of the New York edition.
--

Cheers people
Stephen Loosley
Victoria, Australia


More information about the offtopic mailing list