• Comparison operators

    • LIKE is case sensitive
    • ILIKE is not
    • BETWEEN includes the range bounds
    • IS NULL is for missing values. column=NULL will not work.
    • For non null values - use IS NOT NULL. Don’t use NOT IS NULL
  • Aggregate function

    • COUNT
      • COUNT(col) gives of count of col where values are not null.
      • This is the main difference between COUNT(*) and COUNT(col)
    • SUM, AVG and COUNT can only be used for numeric values. (duh!)
    • AVG ignores null values completely. Does not take in the denominator
    • MIN and MAX can be used for numeric, date, chars
  • GROUP BY

    • LIMIT executes after the GROUP BY
  • HAVING statement has to be written again.

  • CASE WHEN

    • CASE WHEN is equivalent to IF/ELSE

      • CASE WHEN (condition) THEN val
      • WHEN (condition) THEN val
      • ELSE val END as ‘colname’
      • CASE must include the following components: WHEN, THEN, and END. ELSE is an optional component.
    • CASE WHEN is best used with aggregate statements.

      • CASE WHEN, COUNT(*) can also be written with subquery and group by. The difference is data storage in cache.
    • CASE WHEN can be used to pivot tables

      SELECT COUNT(CASE WHEN year = ‘FR’ THEN 1 ELSE NULL END) AS fr_count, COUNT(CASE WHEN year = ‘SO’ THEN 1 ELSE NULL END) AS so_count

      FROM benn.college_football_players

  • DISTINCT

    • Unique pairs - DISTINCT col1, col2
    • COUNT(DISTINCT month) not DISTINCT COUNT(month)
  • ALIAS table name - FROM benn.college_football_players players

  • WHERE clause inside join

    • In this case, filtering happens before join happens

    • LEFT JOIN tutorial.crunchbase_acquisitions acquisitions

      ON companies.permalink = acquisitions.company_permalink

      AND acquisitions.company_permalink != ‘/company/1000memories’

      both conditions have to be satisfied for join to take place.

      So basically, where clause will be applied to only table (most useful when the where is applied on the column that is used for join)

  • UNION

    • UNION is rowbind. removes duplicates
    • UNION ALL does not remove duplicates
  • CAST(colname as integer) will change the column data type

  • Date

  • String functions

    • LEFT(col, 10) - first 10 chars
    • RIGHT(col, 10)- last 10 chars
    • TRIM(both ‘()’ from col) - remove characters from column
      • 3 arguments
          1. where to remove both, leading, trailing
        • 2.what to remove (should be put in quotes)
        • 3.which col to remove
    • POSITION, STRPOS - Case sensitive
      • for find the position number of a character in a string
      • POSITION(‘A’ on colname) or STRPOS(colname, ‘A’)
    • SUBSTR
      • for finding the substring
      • SUBSTR(colname, 4, 2) - substr(colname, starting pos, num chars)
    • CONCAT
      • concatenate two or 3 strings
      • CONCAT(col1, ‘hello’, LEFT(col2, 4))
    • UPPER, LOWER
      • turning strings into cases
    • COALESCE to replace null values with a string
      • COALESCE(colname, ’no desc’)
  • Window functions

    • A window function performs a calculation across a set of table rows that are somehow related to the current row. This is comparable to the type of calculation that can be done with an aggregate function. But unlike regular aggregate functions, use of a window function does not cause rows to become grouped into a single output row — the rows retain their separate identities. Behind the scenes, the window function is able to access more than just the current row of the query result.
    • FORMAT () OVER ()
      • (aggregate function) OVER (partition by col or order by col)
    • Functions
      • All aggregate functions will work
      • ROW_NUMBER()
      • RANK() and DENSE_RANK()
        • row number will give different numbers irrespective of values being same
        • rank will give same numbers(rank) for same values but the next rank would be skipped
        • dense rank is same as rank but the next rank will not be skipped
      • NTILE(num)
        • this will divide the partition column into number of buckets specified by num and give where a given row falls into
        • NTILE(4) - quartile
        • NTILE(5) - quintile
        • NTILE(100) - percentile
      • LAG() and LEAD()
        • LAG is the one before
        • LEAD will give the value of the row after
            
LAG and LEAD window functions example
- LAG(colname, 2) will skip 2 rows instead of 1 - WINDOW statement - SELECT start_terminal, duration_seconds, NTILE(4) OVER ntile_window AS quartile, NTILE(5) OVER ntile_window AS quintile, NTILE(100) OVER ntile_window AS percentile FROM tutorial.dc_bikeshare_q1_2012 WINDOW ntile_window AS (PARTITION BY start_terminal ORDER BY duration_seconds
  • Performance tuning
    • Reducing table size
      • use filters using WHERE
      • Use LIMITS in joining or subqueries if you are just testing out the functionality of the query and are not actually looking at the results
    • JOINS
      • Aggregates before join if possible
    • EXPLAIN to get an understanding of how the query will execute