No matter where you work odds are your company uses some sort of relational database. And nearly all of these use SQL for data retrieval. So it goes without saying that ultimatley if you want to play with data, you need to learn a little bit of SQL. You might argue that SQL does a lot of the same things that you can do in R, Python, or (heaven forbid) Excel. This is true to a certain extent, but by writing the SQL code yourself there a few advantages:
- Performing operations in the database is way more effecient
- Easily reproducible scripts
- Closer to the data
Demand for data scientists is booming, but even highly quantitative analysts and PhDs may have limited experience with databases. In making the transition from academia to industry, a little SQL can go a long way. Sometimes just knowing what features are out there is all it takes.
This is a post about awesome things you can do with SQL.
Why Postgres?
There are lots of database flavors: MySQL, PostgreSQL, Oracle, Microsoft SQL Server. Even with all of the varients, the SQL syntax remians relatively constant across all of them. We’re using PostgreSQL b/c it’s widely used, has handy analytical features, and it’s popular in both data warehousing and production.
-
Generating queries from a query
I often find the need to use data in Postgres to fetch related data found in another system such as a CRM, MongoDB, or a website. Doing basic string concatenation makes it really easy to generate these queries in mass. I’ve found it to be really useful for spidering and web scraping.
1 2 3 4 5 6 7 8 9 10 11 –MongoDB or other databasesselect‘db.foo.findOne({_id: ObjectId(“‘ || _id || ‘”)})’from foo;–db.foo.find({_id: ObjectId(“5066fa4abd8ad53408a4869c”)})–web based (great for spidering)select‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=’ || name as urlfromrestaurants;You can use
psycopg2
to query Postgres from Python and then userequests
to fetch data from Yelp.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import psycopg2import requestsconn = psycopg2.connect(“{YOUR CONNECTION”})cur = conn.cursor()q = “””select‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=’ || name as urlfromrestaurantslimit 3;”””cur.execute(q)urls = cur.fetchall()print urls#[“http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=ray’s pizza”,# ‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=shake shack’,# ‘http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=rubirosa’]# find the number of times pizza is mentioned in each yelp search resultsfor url in urls:html = requests.get(url).textprint url, html.count(“pizza”)#http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=ray’s pizza 212#http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=shake shack 0#http://www.yelp.com/search?find_loc=Manhattan%2C+NY&ns=1&find_desc=rubirosa 63 -
Basic date operations
Dates are always tricky. Despite best efforts, dates never seem to be in the format that you need them in.
Fear not, Postgres has fantasic date functions to deal with this stuff.
date_trunc
,to_char
, andto_timestamp
can take you a pretty long way in terms for formatting and type conversion. For extracting certain values from a date (say the day of the week) you can usedate_part
.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 selectdate_trunc(‘year’, ‘2013-04-05’::date) as year, date_trunc(‘month’, ‘2013-04-05’::date) as month, date_trunc(‘day’, ‘2013-04-05’::date) as day;— year | month | day————————–+————————+————————— 2013-01-01 00:00:00-05 | 2013-04-01 00:00:00-04 | 2013-04-05 00:00:00-04selectto_char(‘2013-04-05’::date, ‘YYYY’) as year, to_char(‘2013-04-05’::date, ‘MM-YYYY’) as month, to_char(‘2013-04-05’::date, ‘YYYYMMDD’) as day;— year | month | day——–+———+———-— 2013 | 04-2013 | 20130405selectto_timestamp(1365560673);— to_timestamp————————— 2013-04-09 22:24:33-04select date_part(‘day’, ‘2013-04-05’::date);— date_part————-— 5select date_part(‘second’, to_timestamp(1365560673));–date_part————-— 33 -
Text mining
When I’m doing text mining I’m often tempted to immediately reach for a scripting language. While a language like Python, Ruby, or R definitely should make a showing in any text mining project, I advocate going as far as you can with SQL. Postgres has great built-in string functions that’ll run much faster than any scripting language.
My favorite Postgres string function is
regexp_split_to_table
which takes a piece of text, splits it by a pattern, and returns tokens as rows.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 –use SQL for parsing textselectspeaker, regexp_split_to_table(lower(line), ‘\s+’) as wordfromscript_ferris_buellerwherespeaker = ‘FERRIS’LIMIT 10;–word counts by characterselectspeaker, regexp_split_to_table(lower(line), ‘\s+’) as word, count(1) as wcfromscript_ferris_buellergroup byspeaker, wordorder by3 descLIMIT 10;— speaker | word | wc———–+——+—–— FERRIS | a | 136— FERRIS | i | 120— FERRIS | to | 119— FERRIS | the | 112— FERRIS | you | 104— FERRIS | and | 101— FERRIS | of | 68— FERRIS | i’m | 52— FERRIS | in | 49— CAMERON | i | 49–(10 rows)See #5 below for loading the Ferris Bueller dataset.
More string functions
-
Median aggregate function
Median is sort of the lost aggregate function in Postgres. It’s one of those that everyone assumes comes standard but it actually doesn’t for reasons that are beyond the scope of this post. For many DBAs and DW Engineers, it’s the first custom function they write for a new Postgres instance. Here you’ll find the page from the Postgres wiki with code snippets for a median function. In the gist below I’m using the pure SQL implementation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 –median (http://wiki.postgresql.org/wiki/Aggregate_Median)CREATE OR REPLACE FUNCTION _final_median(anyarray)RETURNS float8 AS$$WITH q AS(SELECT valFROM unnest($1) valWHERE VAL IS NOT NULLORDER BY 1),cnt AS(SELECT COUNT(*) AS c FROM q)SELECT AVG(val)::float8FROM(SELECT val FROM qLIMIT 2 – MOD((SELECT c FROM cnt), 2)OFFSET GREATEST(CEIL((SELECT c FROM cnt) / 2.0) – 1,0)) q2;$$LANGUAGE sql IMMUTABLE;CREATE AGGREGATE median(anyelement) (SFUNC=array_append,STYPE=anyarray,FINALFUNC=_final_median,INITCOND='{}’);–median and avg scores of NCAA football gamesSELECTmedian(score),avg(score)FROMgame_results;— median | avg———-+———————— 27 | 28.6536160957041871 -
\COPY
to load data into your databaseSooner or later you’re going to want to load some of that great data you’ve scraped from Yelp or some Census CSV file into your database. Once you have your flat file ready,
cd
to the directory and connect to your Postgres instance.We’re going to load some dialogue from Ferris Bueller’s Day Off into our database. You can get the file here. Open up Postgres in the terminal:
$ psql -d {your database} -U {your username}
Now execute the
\COPY
command, specifying the table name, column names (optional), filename, and if you’re handling a CSV, then include theCSV
argument. This file also has a header row so add an additional argumentHEADER
.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 –create a table to store the datacreate table script_ferris_bueller(tag_type varchar,speaker varchar,line text,line_begin int,line_end int,movie_name varchar);–upload the data\COPY script_ferris_bueller FROM ‘ferris_buellers_day_off.txt’ CSV HEADER–more explicit but useful if your columns aren’t in order\COPY script_ferris_bueller(tag_type,speaker,line,line_begin,line_end,movie_name)FROM ‘ferris_buellers_day_off.txt’ CSV HEADER–turn on extended display\x–take a peak at the dataselect * from script_ferris_bueller order by random() limit 1;—[ RECORD 1 ]———————————––tag_type | dialogue–speaker | FERRIS–line | This is optomism. It’s a common— | trait with my age group. Adults— | think it’s cute, it’s like a charming— | quick that infests youth. But it’s— | a cool thing and I think, deep down,— | crusty old shits wish they had some.— | They wish they had her, too.— | (points to Sloane)— | Sorry. She’s taken.–line_begin | 2188–line_end | 2197–movie_name | ferris_bueller’s_day_offAnd there you have it, you can now query Ferris Bueller’s Day Off!
Sometimes you’ll get just one row that has some strange comma or hyphon that messes up your
\COPY
statement. To skirt around this, create a staging table where every column is varchar. Once the data is in your database, create a query that casts columns to the appropriate data type and deal w/ isolated issues at that point. -
Generating sequences
generate_series
is a great Postgres built in function for generating sequences of numbers. You can use it for creating ranges of dates and times, handling time series, funnels, etc. It also provides an easy to way enumerate over tables and prevents you from having to write for loops in your SQL code (not fun).1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 –generate_seriesselect*fromgenerate_series(0,100,10);— generate_series—————–— 0— 10— 20— 30— 40— 50— 60— 70— 80— 90— 100–(11 rows)–using our median functionselectmedian(val)fromgenerate_series(0,100,5) as val;— median——–— 50–(1 row)selectcurrent_date + step.i as date_seriesfromgenerate_series(0,14,2) as step(i);— date_series————-— 2013-04-09— 2013-04-11— 2013-04-13— 2013-04-15— 2013-04-17— 2013-04-19— 2013-04-21— 2013-04-23–(8 rows) -
Assorted things you should know
Some of these are pretty standard, but I couldn’t chance leaving any of these out. I use these features every day.
order by random()
is extremely helpful when you’re just trying to get a feel for a table or database. It does pretty much what it sounds like it does–it returns things in random order! Double colon::
is shorthand for casting columns to different data types. It works well with dates (and does a pretty good job of guessing what format you’re using), and does just as good with floats, ints, and other numerics. Select into syntax is by far the easiest way I’ve found for creating tables on the fly. It’s ruined me for pre-defining tables.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 –returns rows in random orderselect * from sometable order by random();–use ‘::’ for castingselect ‘2013-01-01’::date;— date————–— 2013-01-01select 1 / 10 as non_casted, 1::float8 / 10 as casted;— non_casted | casted————–+——–— 0 | 0.1–creating a new table on the flyselect*intonewtablefromoldtable;–or with a temp tableselect*intotemp new_temptablefromoldtable;