Learning Hub‎ > ‎

Learning programming SQL

Doing Statistics with SQL

http://www.kdnuggets.com/ This post covers how to perform some basic in-database statistical analysis using SQL.

By Jean-Nicholas Hould, JeanNicholasHould.com

I recently wrote an article on why you should learn SQL for data analysis. I received lots of feedback from this post. People wanted to see some concrete examples of using SQL for data analysis. I decided to apply some of my own advice and load up one of these awesome datasets to do some basic data exploration queries on it.

For this post, I am using an open data set. The million songs data sets is a«freely-available collection of audio features and metadata for a million contemporary popular music tracks».

If you want to follow along, here are the steps:

  1. Download the Million Song Data Set. I used the 10K songs subset.
  2. Download a SQLite Client
  3. Open subset_track_metadata.db file with SQLite and start exploring

Why Descriptive Statistics?

 
When I start the analysis of a new data set, I run some basic queries to get a sense of how the data is organized, how the values are distributed. I am trying to understand what I’m dealing with. For numerical data, I will often run some descriptive statistics on the data set: I will measure the central tendency (meanmedianmode) and measure the level of variability of the data. Those measurements are generally a good start to data exploration. They will often lead to new questions that fuel my analysis.

Central Tendency

 
Mean

The mean is the number you obtain when you sum up a given set of numbers and then divide this sum by the total number in the set. The mean is very sensible to outliers. It can be drastically affected by values that are much higher or lower compared to the rest of the data set.

SELECT CAST(AVG(songs.year) as int) as avg_year FROM songs

-- | avg_year |
-- |----------|
-- | 934      |


  • CASTRun-time data type conversion between compatible data types. In this case, I’m converting a float to an integer for rounding purpose.
  • AVGAggregation function that returns the mean of the input expression value.
  • as avg_yearTemporarily renames a column heading - This is only for readbility purpose, to make the code more human-friendly. I will use this aliasing throughout the post.

Median

The median is the number separating the higher half of a ordered data set from the lower half. The median is sometimes a better measure of a mid point because each data point is weighted equally.

SELECT songs.year as median_year
FROM songs 
ORDER BY songs.year 
LIMIT 1 
OFFSET (SELECT COUNT(*) FROM songs) / 2

-- | median_year |
-- |-------------|
-- | 0           |


  • ORDER BYSorts the resulting data set by one or more column. The ordering can be ascending ASC (default) or descending DESC.
  • COUNTAggregation function that returns the # of rows that matches a criteria.
  • LIMITSpecifies the maximum number of rows that can be returned from the resulting data set.
  • OFFSETSkip X rows before beginning to return rows. In this specific example, we are skipping 5000 rows (which is equal to the total row countCOUNT(*) divided by 2).

Mode

The mode is the value that appears most often in a set of data.

SELECT 
    songs.year,
    COUNT(*) as count
FROM songs
GROUP BY songs.year
ORDER BY COUNT(*) DESC
LIMIT 1

-- | year | count |
-- |------|-------|
-- | 0    | 5320  |


  • GROUP BYUsed with aggregation functions such as COUNTAVG, etc. Groups the resulting data set by one or more column.

Variability

 
Min/Max Values

Minimum/Maximum value in a data set.

SELECT 
    MIN(songs.year) as min_year,
    MAX(songs.year) as max_year
FROM
    songs

-- | min_year | max_year |
-- |----------|----------|
-- | 0        | 2010     |


  • MINAggregation function that returns the smallest value in a data set.
  • MAXAggregation function that returns the largest value in a data set.

Distribution of songs per year

Count of songs released in each years

SELECT 
    songs.year,
    COUNT(*) songs_count
FROM songs
GROUP BY songs.year
ORDER BY songs.year ASC

-- | year | song_count |
-- |------|------------|
-- | 0    | 5320       |
-- | 1926 | 2          |
-- | 1927 | 3          |
-- | ...  | ...        |
-- | 2009 | 250        |
-- | 2010 | 64         |


Next Steps

 
The SQL queries in this post are fairly simple. They are not the result of technical gymmstastics. They are just simple measurements helping us understand the data set and that’s what’s great about them.

In this specific data set, we noticed that for more than half of the data set, the year of the song is equal to 0. This means we are either looking at a data set of very old songs or that we are dealing with missing values. The latter is more realistic. If we filter out songs from year 0, our data makes more sense. The songs are ranging from 1926 to 2010 and the median is the year 2001.

With the data partially cleaned up, we can start exploring other columns of our data set and asking ourselves more questions: How many unique artists composed songs per year? How has that evolved through time? Are songs shorter nowadays than before? What’s great about those simple measurements is that they can be reused as our queries and filters get more and more complex. By applying the simple descriptive statistics measurements, we can have a good grasps of the data and we can keep exploring deeper and deeper.

Bio: Jean-Nicholas Hould is a Data Scientist from Montreal, Canada. Author at JeanNicholasHould.com.

7 Steps to Mastering SQL for Data Science

Follow these 7 steps to go from SQL data science newbie to seasoned practitioner quickly. No nonsense, just the necessities.

By , KDnuggets.

Structured Query Language (forever and always referred to henceforth as SQL, and usually pronounced "sequel") is the language for managing and querying data in relational database management systems (RDBMS).

So intertwined are the terms SQL and RDBMS that they are often conflated, sometimes by the uninitiated, but often simply out of convenience, and the term SQL is used adversarially to distignuish relational systems from non-relational database systems, which are often categorized by the umbrella term "NoSQL." Yet, SQL skills aren't wasted on non-RDBMS systems; the top data processing frameworks all have some implementation of SQL that sits atop their architecture, including Apache Big Data behemoths Spark andHadoop.


Together with R and Python, SQL forms the triumvirate of data science programming languages (liberal usage of the term "programming languages"); in fact, the most recent KDnuggets data science and analyticssoftware poll results indicate that these are also the 3 most-used tools by respondents, when all software is considered.

Clearly, SQL is important in data science. As such, this post aims to take a reader from SQL newbie to competent practitioner in a short time, using freely-available online resources. Lots of such resources exist on the internet, but mapping out a path from start to finish, using items which complement each other, is not always as straightforward as it may seem. Hopefully this post can be of assistance in this manner.

Step 1: Relational Database Basics

First off, since SQL is used to manage and query data in relational databases, having an understanding of what, exactly, a relational database is would be a good start.

Jennifer Widom is a renowned database researcher and professor at Stanford. She also put together one of the very first MOOCs on Coursera. Videos from this course are a good place to start:

A good introductory read on the relational database model is presented by UMass Boston's Dan Simovici here:

Extra credit: For a better theoretical understanding of the foundations upon which relational database systems are built, have a look at Widom's videos on relational algebra:

Step 2: SQL Overview

Now that we're all familiar with the relational paradigm, we shift focus to SQL.

First, get a high-level overview, again from Widom:

Read the first few pages of Simovici's SQL notes, or until it stops making intuitive sense:

For a great extended introductory overview of basic SQL commands, watch the following video from Khan Academy:

Once finished with the video, the following is a good treatment of SQL commands, with explanations, and can also serve as a useful reference moving forward, so keep it handy:

Finally, in preparation of the of next step, get an SQL environment up and running. You may not want to enter every SQL statement you encounter, but having an SQL interpreter up and running just makes sense. I suggest installing SQLite locally; it is a simple, but capable, SQL installation:



Step 3: Selecting, Inserting, Updating

While SQL can be used to perform a number of varied data management tasks, querying databases with SELECT, inserting records with INSERT, and updating existing records with UPDATE are some of the most heavily-used commands, and are good places to get started in practical SQL. Read and go through the following preliminary exercises:

Select

Insert

Update

This tutorial covers SQL basics and beginner to intermediate queries, and is a solid review of what we have seen to this point, and then some. It is presented from a MySQL point of view, but nothing in here is exclusive to that platform's SQL implementation.

  • MySQL Tutorial 1: Overview, Tables, Queries (from arachnoid.com)
  • Step 4: Creating, Dropping, Deleting

    Our second set of commands include those used to CREATE and DROP tables, as well as to DELETE records. With an understanding of this growing collection of commands, suddenly much of what could be referred to asregular data management and query is attainable (with practice, of course).

    Create

    Drop

    Delete

    Step 5: Views and Joins

    On to some slightly more advanced SQL topics. First, we have a look at views, which can be thoughts of as virtual tables populated by the results of queries, useful for a number of different scenarios including application development, data security, and eased data sharing.

    First, get the details on what views are:

    For beginner SQL practitioners in the data science realm, I would call views a "nice to have." Since the focus is likely to be more on data exploration, I would say that the next topic, joins, is a "must have."

    Read about what joins are, and why they are important (and get some examples):

    Joins come in different flavors, and likely one of the more complex topics you will cover while learning SQL is getting them straight. That's really more of a testament to the ease of SQL than the actual difficulty of learning about joins.

    Watch a video explaining inner joins, then check out a video on outer and cross joins:


    Visual representation of SQL joins

    See this visual representation of SQL joins:

    Finally, this tutorial reviews both joins and views:

    Step 6: SQL for Data Science

    OK, so you've made some headway with learning SQL. You can query some data, create and manage some tables, make a view if you had to, and even use joins in some more complex querying. But why are you learning all this again? For data science, right? Let's take a bit of a break from the technical to get an idea of this very topic.

    Here are a couple of discussions regarding what SQL can be used for in data science:

    Step 7: SQL Integration with Python, R

    Often we will find that SQL is embedded in, or called from within, software written in other programming languages as part of a larger system. For example, in web development you may find PHP or Ruby or some other language making calls to a database via SQL to input, modify, or retrieve application-related data. In data science, you may see SQL being called as part of some application written in, perhaps, Python or R. To that end, having an understanding of how these languages play with SQL is not a bad idea.

    Python:

    To gain an understanding of how Python and SQL can work together, readSebastian Raschka's fantastic and detailed post on using SQLite in Python:

    R:

    Here are a pair of resources for achieving R and SQL integration, which approach the topic from different directions:

    Further

    If you feel like a relentless regimen of reading about an SQL topic and following up with exercises, I recommend going through the following (freely-available) book:


Comments