Learning Hub‎ > ‎Reading from KD‎ > ‎

Excel vs R

I made my first charts and did my first aggregations in high school and college, using Excel. As I went through college, grad school and ~7 years of work experience, I quickly picked up what I consider to be more advanced tools, like SQL, R, Python, Hadoop, LaTeX, etc.

We are interviewing for a data scientist position and one candidate advertises himself as a "senior data scientist" (a very buzzy term these days) with 15+ years experience. When asked what his preferred toolset was, he responded that it was Excel.

I took this as evidence that he was not as experienced as his resume would claim, but wasn't sure. After all, just because it's not my preferred tool, doesn't mean it's not other people's. Do experienced data scientists use Excel? Can you assume a lack of experience from someone who does primarily use Excel?

-----------------------------

I'd say that Excel is usefull to present results, not to do the big data processing in itself: your big data "customers" are expecting to have results presented the way they are used to. For example, it's not uncommon for my to export a result pandas data frame to excel and then change the layout of the resulting file only ... to make the "customers" happy.

4.5k

Do experienced data scientists use Excel?

I've seen some experienced data scientists, who use Excel - either due to their preference, or due to their workplace's business and IT environment specifics (for example, many financial institutions use Excel as their major tool, at least, for modeling). However, I think that most experienced data scientists recognize the need to use tools, which are optimal for particular tasks, and adhere to this approach.

Can you assume a lack of experience from someone who does primarily use Excel?

No, you cannot. This is the corollary from my above-mentioned thoughts. Data science does not automatically imply big data - there is plenty of data science work that Excel can handle quite well. Having said that, if a data scientist (even experienced one) does not have knowledge (at least, basic) of modern data science tools, including big data-focused ones, it is somewhat disturbing. This is because experimentation is deeply ingrained into the nature of data science due to exploratory data analysis being a essential and, even, a crucial part of it. Therefore, a person, who does not have an urge to explore other tools within their domain, could rank lower among candidates in the overall fit for a data science position (of course, this is quite fuzzy, as some people are very quick in learning new material, plus, people might have not had an opportunity to satisfy their interest in other tools due to various personal or workplace reasons).

Therefore, in conclusion, I think that the best answer an experienced data scientist might have to a question in regard to their preferred tool is the following: My preferred tool is the optimal one, that is the one that best fits the task at hand.


---
I would never fault someone for not knowing Hadoop but even in small data situations I feel as if R is superior. There are simply a miriad of things you can do with R that you can't do with Excel. It concerns me this individual has not "discovered" that in his 15+ years
-----
Are your familiar with the term "good enough"? I'm also a big fan of R and would prefer it to many tools, Excel included, any day. However, the fact that R can do more doesn't imply that Excel (or any other tool suitable for a task) is inferior in a particular work context. So, while your concern is valid (I refer to that by using word "disturbing"), it might be that the person haven't had an opportunity/need to do that. Remember, that you're talking about time, when R existed, but was popular mostly in academia and data science (termed data analysis or such) was not as hot as today.

400:

Most non-technical people often use Excel as a database replacement. I think that's wrong but tolerable. However, someone who is supposedly experienced in data analysis simply can not use Excel as his main tool (excluding the obvious task of looking at the data for the first time). That's because Excel was never intended for that kind of analysis and as a consequence of this, it is incredibly easy to make mistakes in Excel (that's not to say that it is not incredibly easy to make another type of mistakes when using other tools, but Excel aggravates the situation even more.)

To summarize what Excel doesn't have and is a must for any analysis:

  1. Reproducibility. A data analysis needs to be reproducible.
  2. Version control. Good for collaboration and also good for reproducibility. Instead of using xls, use csv (still very complex and has lots of edge cases, but csv parsers are fairly good nowadays.)
  3. Testing. If you don't have tests, your code is broken. If your code is broken, your analysis is worse than useless.
  4. Maintainability.
  5. Accuracy. Numerical accuracy, accurate date parsing, among others are really lacking in Excel.

More resources:

European Spreadsheet Risks Interest Group - Horror Stories

You shouldn’t use a spreadsheet for important work (I mean it)

Microsoft's Excel Might Be The Most Dangerous Software On The Planet

Destroy Your Data Using Excel With This One Weird Trick!

Excel spreadsheets are hard to get right

--------------------------

Excel allows only very small data and doesn't have anything that is sufficiently useful and flexible for machine learning or even just plotting. All I would do in Excel, is stare at a subset of the data for a first glance over the values to make sure I don't miss anything visible by eye.

So, if his favourite tool is Excel, this might suggest he rarely deals with machine learning, statistics, larger data sizes or any advanced plotting. Someone like this I wouldn't call a Data Scientist. Of course titles don't matter and it depends a lot on your requirements.

In any case, don't make a judgement by statements of experience or CV. I've seen CVs and known the people behind it.

Don't assume. Test him! You should be good enough to set up a test. It has been shown that interviews alone are close to useless to determine skills (they only show personality). Set up a very simple supervised learning test and let him use any tool he wants.

And if you want to screen people at an interview first, then ask him about very basic but important insights about statistics or machine learning. Something that every single of your current employees knows.

--------------------------
So, does that mean only specializing in one thing bad? No. Plenty of my friends specialize in just one main language and kill it. I know plenty of data guys who only know R and kill it. I also know plenty of people who just use Excel to analyze data because that's the only thing most non-data scientist can open and use (especially in B2B companies). The question you really need to answer is if this one thing is the ONE thing you need for this position? And most importantly, can they learn new things?
--------------------------
In his book Data Smart, John Foreman solves common data science problems (clustering, naive bayes, ensemble methods,...) using Excel. Indeed it's always good to have some knowledge of Python or R but I guess Excel can still get most of the job done !

----------------
I think most people are answering without having a good knowledge of excel. Excel (since 2010) has an in memory columnar [multi table] database , called power pivot (which allows input from csv/databases etc), allowing it to store millions of rows (it doesn't have to be loaded on a spreadsheet). It also has an ETL tool called power query allowing you to read the data from a variety of sources (including hadoop). And it has a visualisation tool (power view & power map). A lot of Data Science is doing aggregation and top-n analysis at which power pivot excels. Add to this the interactive nature of these tools - any user can easily drag and drop a dimension on which to break up the results adn I hope you can see the benefits. So yes you can't do machine learning, but I would question how much machine learning is done by data scientists day to day: eg when I want to analyse the prediction errors made in machine learning program I find it easiest to slice and dice the errors with excel.
Comments