7 handy SQL features for data scientists

No matter where you work odds are your company uses some sort of relational database. And nearly all of these use SQL for data retrieval. So it goes without saying that ultimatley if you want to play with data, you need to learn a little bit of SQL. You might argue that SQL does a lot of the same things that you can do in R, Python, or (heaven forbid) Excel. This is true to a certain extent, but by writing the SQL code yourself there a few advantages:

Performing operations in the database is way more effecient
Easily reproducible scripts
Closer to the data

Demand for data scientists is booming, but even highly quantitative analysts and PhDs may have limited experience with databases. In making the transition from academia to industry, a little SQL can go a long way. Sometimes just knowing what features are out there is all it takes.

This is a post about awesome things you can do with SQL.

Why Postgres?

There are lots of database flavors: MySQL, PostgreSQL, Oracle, Microsoft SQL Server. Even with all of the varients, the SQL syntax remians relatively constant across all of them. We’re using PostgreSQL b/c it’s widely used, has handy analytical features, and it’s popular in both data warehousing and production.

Generating queries from a query

I often find the need to use data in Postgres to fetch related data found in another system such as a CRM, MongoDB, or a website. Doing basic string concatenation makes it really easy to generate these queries in mass. I’ve found it to be really useful for spidering and web scraping.

1 2 3 4 5 6 7 8 9 10 11

–MongoDB or other databases

select

‘db.foo.findOne({_id: ObjectId(“‘ || _id || ‘”)})’

from foo;

–db.foo.find({_id: ObjectId(“5066fa4abd8ad53408a4869c”)})

–web based (great for spidering)

select

‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=’ || name as url

from

restaurants;

view raw generate_queries.sql This Gist brought to you by GitHub.

You can use psycopg2 to query Postgres from Python and then use requests to fetch data from Yelp.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29

import psycopg2

import requests

conn = psycopg2.connect(“{YOUR CONNECTION”})

cur = conn.cursor()

q = “””

select

‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=’ || name as url

from

restaurants

limit 3;”””

cur.execute(q)

urls = cur.fetchall()

print urls

#[“http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=ray’s pizza”,

# ‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=shake shack’,

# ‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=rubirosa’]

# find the number of times pizza is mentioned in each yelp search results

for url in urls:

html = requests.get(url).text

print url, html.count(“pizza”)

#http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=ray’s pizza 212

#http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=shake shack 0

#http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=rubirosa 63

view raw spider_with_sql.py This Gist brought to you by GitHub.

Basic date operations

Dates are always tricky. Despite best efforts, dates never seem to be in the format that you need them in.

iso_8601 Fear not, Postgres has fantasic date functions to deal with this stuff.

date_trunc, to_char, and to_timestamp can take you a pretty long way in terms for formatting and type conversion. For extracting certain values from a date (say the day of the week) you can use date_part.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

select

date_trunc(‘year’, ‘2013-04-05’::date) as year

, date_trunc(‘month’, ‘2013-04-05’::date) as month

, date_trunc(‘day’, ‘2013-04-05’::date) as day;

— year | month | day

————————–+————————+————————

— 2013-01-01 00:00:00-05 | 2013-04-01 00:00:00-04 | 2013-04-05 00:00:00-04

select

to_char(‘2013-04-05’::date, ‘YYYY’) as year

, to_char(‘2013-04-05’::date, ‘MM-YYYY’) as month

, to_char(‘2013-04-05’::date, ‘YYYYMMDD’) as day;

— year | month | day

——–+———+———-

— 2013 | 04-2013 | 20130405

select

to_timestamp(1365560673);

— to_timestamp

————————

— 2013-04-09 22:24:33-04

select date_part(‘day’, ‘2013-04-05’::date);

— date_part

————-

— 5

select date_part(‘second’, to_timestamp(1365560673));

–date_part

————-

— 33

view raw basic_dates.sql This Gist brought to you by GitHub.

Text mining

When I’m doing text mining I’m often tempted to immediately reach for a scripting language. While a language like Python, Ruby, or R definitely should make a showing in any text mining project, I advocate going as far as you can with SQL. Postgres has great built-in string functions that’ll run much faster than any scripting language.

My favorite Postgres string function is regexp_split_to_table which takes a piece of text, splits it by a pattern, and returns tokens as rows.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

–use SQL for parsing text

select

speaker

, regexp_split_to_table(lower(line), ‘\s+’) as word

from

script_ferris_bueller

where

speaker = ‘FERRIS’

LIMIT 10;

–word counts by character

select

speaker

, regexp_split_to_table(lower(line), ‘\s+’) as word

, count(1) as wc

from

script_ferris_bueller

group by

speaker

, word

order by

3 desc

LIMIT 10;

— speaker | word | wc

———–+——+—–

— FERRIS | a | 136

— FERRIS | i | 120

— FERRIS | to | 119

— FERRIS | the | 112

— FERRIS | you | 104

— FERRIS | and | 101

— FERRIS | of | 68

— FERRIS | i’m | 52

— FERRIS | in | 49

— CAMERON | i | 49

–(10 rows)

view raw string_split_example.sql This Gist brought to you by GitHub.

See #5 below for loading the Ferris Bueller dataset.

More string functions

Median aggregate function

Median is sort of the lost aggregate function in Postgres. It’s one of those that everyone assumes comes standard but it actually doesn’t for reasons that are beyond the scope of this post. For many DBAs and DW Engineers, it’s the first custom function they write for a new Postgres instance. Here you’ll find the page from the Postgres wiki with code snippets for a median function. In the gist below I’m using the pure SQL implementation.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

–median (http://wiki.postgresql.org/wiki/Aggregate_Median)

CREATE OR REPLACE FUNCTION _final_median(anyarray)

RETURNS float8 AS

WITH q AS

(

SELECT val

FROM unnest($1) val

WHERE VAL IS NOT NULL

ORDER BY 1

cnt AS

(

SELECT COUNT(*) AS c FROM q

)

SELECT AVG(val)::float8

FROM

(

SELECT val FROM q

LIMIT 2 – MOD((SELECT c FROM cnt), 2)

OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) – 1,0)

) q2;

LANGUAGE sql IMMUTABLE;

CREATE AGGREGATE median(anyelement) (

SFUNC=array_append,

STYPE=anyarray,

FINALFUNC=_final_median,

INITCOND='{}’

);

–median and avg scores of NCAA football games

SELECT

median(score)

,avg(score)

FROM

game_results;

— median | avg

———-+———————

— 27 | 28.6536160957041871

view raw pg_median.sql This Gist brought to you by GitHub.

`\COPY` to load data into your database

Sooner or later you’re going to want to load some of that great data you’ve scraped from Yelp or some Census CSV file into your database. Once you have your flat file ready, cd to the directory and connect to your Postgres instance.

We’re going to load some dialogue from Ferris Bueller’s Day Off into our database. You can get the file here. Open up Postgres in the terminal:

$ psql -d {your database} -U {your username}

Now execute the \COPY command, specifying the table name, column names (optional), filename, and if you’re handling a CSV, then include the CSV argument. This file also has a header row so add an additional argument HEADER.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36

–create a table to store the data

create table script_ferris_bueller

(

tag_type varchar,

speaker varchar,

line text,

line_begin int,

line_end int,

movie_name varchar

);

–upload the data

\COPY script_ferris_bueller FROM ‘ferris_buellers_day_off.txt’ CSV HEADER

–more explicit but useful if your columns aren’t in order

\COPY script_ferris_bueller(tag_type,speaker,line,line_begin,line_end,movie_name)

FROM ‘ferris_buellers_day_off.txt’ CSV HEADER

–turn on extended display

–take a peak at the data

select * from script_ferris_bueller order by random() limit 1;

—[ RECORD 1 ]———————————–

–tag_type | dialogue

–speaker | FERRIS

–line | This is optomism. It’s a common

— | trait with my age group. Adults

— | think it’s cute, it’s like a charming

— | quick that infests youth. But it’s

— | a cool thing and I think, deep down,

— | crusty old shits wish they had some.

— | They wish they had her, too.

— | (points to Sloane)

— | Sorry. She’s taken.

–line_begin | 2188

–line_end | 2197

–movie_name | ferris_bueller’s_day_off

view raw postgres_copy.sql This Gist brought to you by GitHub.

And there you have it, you can now query Ferris Bueller’s Day Off!

Sometimes you’ll get just one row that has some strange comma or hyphon that messes up your \COPY statement. To skirt around this, create a staging table where every column is varchar. Once the data is in your database, create a query that casts columns to the appropriate data type and deal w/ isolated issues at that point.

Generating sequences

generate_series is a great Postgres built in function for generating sequences of numbers. You can use it for creating ranges of dates and times, handling time series, funnels, etc. It also provides an easy to way enumerate over tables and prevents you from having to write for loops in your SQL code (not fun).

MySQL version

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45

–generate_series

select

from

generate_series(0,100,10);

— generate_series

—————–

— 0

— 10

— 20

— 30

— 40

— 50

— 60

— 70

— 80

— 90

— 100

–(11 rows)

–using our median function

select

median(val)

from

generate_series(0,100,5) as val;

— median

——–

— 50

–(1 row)

select

current_date + step.i as date_series

from

generate_series(0,14,2) as step(i);

— date_series

————-

— 2013-04-09

— 2013-04-11

— 2013-04-13

— 2013-04-15

— 2013-04-17

— 2013-04-19

— 2013-04-21

— 2013-04-23

–(8 rows)

view raw generate_series_example.sql This Gist brought to you by GitHub.

Assorted things you should know

Some of these are pretty standard, but I couldn’t chance leaving any of these out. I use these features every day.

order by random() is extremely helpful when you’re just trying to get a feel for a table or database. It does pretty much what it sounds like it does–it returns things in random order! Double colon :: is shorthand for casting columns to different data types. It works well with dates (and does a pretty good job of guessing what format you’re using), and does just as good with floats, ints, and other numerics. Select into syntax is by far the easiest way I’ve found for creating tables on the fly. It’s ruined me for pre-defining tables.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27

–returns rows in random order

select * from sometable order by random();

–use ‘::’ for casting

select ‘2013-01-01’::date;

— date

————–

— 2013-01-01

select 1 / 10 as non_casted, 1::float8 / 10 as casted;

— non_casted | casted

————–+——–

— 0 | 0.1

–creating a new table on the fly

select

into

newtable

from

oldtable;

–or with a temp table

select

into

temp new_temptable

from

oldtable;

Joseph Forbes (691)

Information Technology Consultant. For SMB, SOHO, and Online business. From Computers to Telecommunications this guy has been into it since hippies made it hip. Drone Pilot and Tech Aficionado I get to travel the State of Texas to help businesses succeed.

7 handy SQL features for data scientists

Why Postgres?

Generating queries from a query

Basic date operations

Text mining

More string functions

Median aggregate function

`\COPY` to load data into your database

Generating sequences

Assorted things you should know

Joseph Forbes (691)

User Menu

Recent Posts

Archived Posts

7 handy SQL features for data scientists

Why Postgres?

Generating queries from a query

Basic date operations

Text mining

More string functions

Median aggregate function

\COPY to load data into your database

Generating sequences

Assorted things you should know

Joseph Forbes (691)

User Menu

Recent Posts

Archived Posts

`\COPY` to load data into your database