JOINs, and some VIEWs, in MySQL

This is the third in a series of posts on MySQL, starting with Installing MySQL on Ubuntu 14.04 and Employees database for MySQL, setup and simple queries. The second post also covered simple queries using SELECT, LIMIT, ORDER BY, etc. If you are starting out, you should start with these posts. In this post I will go over multi-table queries, using JOINs (with some help from VIEWs). As you might expect, I’ll use the employees sample database, setup in the previous post. If that sounds like fun, or at least useful, follow along.

Let’s get started... first I use our user account to logon:

$ mysql -u username -p

then, I switch to the employees database – if you don’t have these setup look at my previous posts on these topics (links above):

mysql> USE employees;

Let’s take a look at the tables again to remember what’s in the database:

mysql> SHOW TABLES;
+---------------------+
| Tables_in_employees |
+---------------------+
| departments         |
| dept_emp            |
| dept_manager        |
| employees           |
| salaries            |
| titles              |
+---------------------+
6 rows in set (0.00 sec)

I’ll focus on employee salaries, so let’s look at these tables, as well as the number of entries:

mysql> DESCRIBE employees;
+------------+---------------+------+-----+---------+-------+
| Field      | Type          | Null | Key | Default | Extra |
+------------+---------------+------+-----+---------+-------+
| emp_no     | int(11)       | NO   | PRI | NULL    |       |
| birth_date | date          | NO   |     | NULL    |       |
| first_name | varchar(14)   | NO   |     | NULL    |       |
| last_name  | varchar(16)   | NO   |     | NULL    |       |
| gender     | enum('M','F') | NO   |     | NULL    |       |
| hire_date  | date          | NO   |     | NULL    |       |
+------------+---------------+------+-----+---------+-------+
6 rows in set (0.00 sec)

mysql> SELECT COUNT(emp_no) AS NumEmployees FROM employees;
+--------------+
| NumEmployees |
+--------------+
|       300024 |
+--------------+
1 row in set (0.07 sec)

mysql> DESCRIBE salaries;
+-----------+---------+------+-----+---------+-------+
| Field     | Type    | Null | Key | Default | Extra |
+-----------+---------+------+-----+---------+-------+
| emp_no    | int(11) | NO   | PRI | NULL    |       |
| salary    | int(11) | NO   |     | NULL    |       |
| from_date | date    | NO   | PRI | NULL    |       |
| to_date   | date    | NO   |     | NULL    |       |
+-----------+---------+------+-----+---------+-------+
4 rows in set (0.01 sec)

mysql> SELECT COUNT(emp_no) AS NumSalaries FROM salaries;
+-------------+
| NumSalaries |
+-------------+
|     2844047 |
+-------------+
1 row in set (0.66 sec)

Hmm, many more entries in the salaries table. This table must include a complete history with all salary levels. We can use the DISTINCT command to get a count of the number of unique employees in the salaries table:

mysql> SELECT COUNT(DISTINCT emp_no) AS NumSalaries FROM salaries;
+-------------+
| NumSalaries |
+-------------+
|      300024 |
+-------------+
1 row in set (0.67 sec)

Good, that matches with the number of employees, as expected. However, when I do later queries I don’t want to deal with multiple salaries for each employee, instead let’s use the maximum value available. To do this I will create a VIEW – a virtual table made from a query that we can use for later queries.

A detour on VIEWs

A VIEW can be created with a query as follows (I’ll post just the code, without the mysql prompt, so that the statement can be copied into a running MySQL session– use CNTRL-SHFT-V to paste to the terminal):

CREATE VIEW salaries_max AS
SELECT
  emp_no, MAX(salary) AS salary
FROM
  salaries
GROUP BY
  emp_no;

Now, if I look at the tables in the database I see the new VIEW salaries_max created above:

mysql> SHOW TABLES;
+---------------------+
| Tables_in_employees |
+---------------------+
| departments         |
| dept_emp            |
| dept_manager        |
| employees           |
| salaries            |
| salaries_max        |
| titles              |
+---------------------+
7 rows in set (0.00 sec)

Back to JOINS

Let’s try to get the names and genders of the top-10 paid employees. To do this I need to pull from two tables. This requires a JOIN. In this case, the query is:

SELECT
  msa.salary, em.emp_no, em.first_name,
  em.last_name, em.gender
FROM
  salaries_max AS msa
  JOIN
  employees AS em
ON
  msa.emp_no = em.emp_no
ORDER BY
  msa.salary DESC
LIMIT 10;

resulting in:

+--------+--------+------------+-----------+--------+
| salary | emp_no | first_name | last_name | gender |
+--------+--------+------------+-----------+--------+
| 158220 |  43624 | Tokuyasu   | Pesch     | M      |
| 156286 | 254466 | Honesty    | Mukaidono | M      |
| 155709 |  47978 | Xiahua     | Whitcomb  | M      |
| 155513 | 253939 | Sanjai     | Luders    | M      |
| 155377 | 109334 | Tsutomu    | Alameldin | M      |
| 154459 |  80823 | Willard    | Baca      | M      |
| 154376 | 493158 | Lidong     | Meriste   | M      |
| 153715 | 205000 | Charmane   | Griswold  | M      |
| 152710 | 266526 | Weijing    | Chenoweth | F      |
| 152687 | 237542 | Weicheng   | Hatcliff  | F      |
+--------+--------+------------+-----------+--------+
10 rows in set (0.92 sec)

Let’s point out the essential features of this query:

  • In the SELECT portion of the query I must specify the source tables, here I use msa and em– these are aliases defined later in the query.
  • The FROM portion of the query uses AS to alias (give short names) to the source tables and specify the JOIN – in this case I use an INNER JOIN. I’ll go over the different type below.
  • The ON statement specifies the field used to JOIN the tables. In this case I use the unique employee number emp_no.
  • The ORDER BY and LIMIT serve the same purpose as we’ve seen with single-table queries in previous posts– to sort and limit the number of records returned.

So, that’s the basics of a JOIN. Note: If I had not constructed our new VIEW, I would get the following (notice that I use the salaries table instead of salaries_max view):

SELECT
  sa.salary, em.emp_no, em.first_name,
  em.last_name, em.gender
FROM
  salaries AS sa
  JOIN
  employees AS em
ON
  sa.emp_no = em.emp_no
ORDER BY
  sa.salary DESC
LIMIT 10;

resulting in:

+--------+--------+------------+-----------+--------+
| salary | emp_no | first_name | last_name | gender |
+--------+--------+------------+-----------+--------+
| 158220 |  43624 | Tokuyasu   | Pesch     | M      |
| 157821 |  43624 | Tokuyasu   | Pesch     | M      |
| 156286 | 254466 | Honesty    | Mukaidono | M      |
| 155709 |  47978 | Xiahua     | Whitcomb  | M      |
| 155513 | 253939 | Sanjai     | Luders    | M      |
| 155377 | 109334 | Tsutomu    | Alameldin | M      |
| 155190 | 109334 | Tsutomu    | Alameldin | M      |
| 154888 | 109334 | Tsutomu    | Alameldin | M      |
| 154885 | 109334 | Tsutomu    | Alameldin | M      |
| 154459 |  80823 | Willard    | Baca      | M      |
+--------+--------+------------+-----------+--------+
10 rows in set (2.05 sec)

which has the undesirable multiple entries for some employees.

JOINs: INNER, LEFT and RIGHT

Now that we have some sense of the spirit of the JOIN from the above example let’s consider the types of JOINs available in MySQL. A useful set of reference pages are available at w3schools: SQL JOIN, SQL LEFT JOIN and SQL RIGHT JOIN– take a look at those for more examples. Also, as I mentioned in previous posts, I found the new book Jump Start MySQL very clear and helpful.

First I will create two VIEWs, based on the employees database, that are small and will help show the difference between the types of JOINs:

CREATE VIEW small_salaries AS
SELECT
  *
FROM
  salaries_max
WHERE
  emp_no IN (254466, 47978, 253939);

and

CREATE VIEW small_employees AS
SELECT
  emp_no, first_name, gender
FROM
  employees
WHERE
  emp_no IN (254466, 47978, 237542);

After creating these VIEWs we can see what’s in them with a simple SELECT

mysql> SELECT * FROM small_salaries;
+--------+--------+
| emp_no | salary |
+--------+--------+
|  47978 | 155709 |
| 253939 | 155513 |
| 254466 | 156286 |
+--------+--------+
3 rows in set (0.86 sec)

mysql> SELECT * FROM small_employees;
+--------+------------+--------+
| emp_no | first_name | gender |
+--------+------------+--------+
|  47978 | Xiahua     | M      |
| 237542 | Weicheng   | F      |
| 254466 | Honesty    | M      |
+--------+------------+--------+
3 rows in set (0.00 sec)

The import difference between these VIEWs is that small_employees does not have emp_no 253939 and small_salaries does not have emp_no 237542. This difference is by design and will allow us to see the result of different types of JOINs.

To help in visually understanding the JOINs I will use Venn diagrams that show the emp_no‘s in the two VIEWs:

Venn diagram for two VIEWs

From this figure we can see that emp_no‘s 254466, 47978 and 253939 are all in small_salaries– this is indicated by the fact that all these numbers are inside the blue circle (right). Similarly, we can see that emp_no‘s 254466, 47978 and 237542 are in small_employees– indicated by the fact that these number are inside the green circle (left). Because emp_no‘s 254466 and 47978 are inside both circles, we know that these numbers appear in both VIEWs. However, 237542 and 253939 appear in only one VIEW and this is visually reflected by the fact that the numbers appear in only one of the circles, green or blue, not both. In the examples below, I’ll highlight regions of the Venn diagram to indicate the emp_no‘s returned by each of the JOIN types.

JOIN or INNER JOIN

The JOIN is an INNER JOIN and returns rows that are in both TABLEs or VIEWs. So, using this command:

SELECT
  sem.*, ssa.*
FROM
  small_employees AS sem
JOIN
  small_salaries AS ssa
ON
  sem.emp_no = ssa.emp_no;

or, this one:

SELECT
  sem.*, ssa.*
FROM
  small_employees AS sem
INNER JOIN
  small_salaries AS ssa
ON
  sem.emp_no = ssa.emp_no;

will produce:

+--------+------------+--------+--------+--------+
| emp_no | first_name | gender | emp_no | salary |
+--------+------------+--------+--------+--------+
|  47978 | Xiahua     | M      |  47978 | 155709 |
| 254466 | Honesty    | M      | 254466 | 156286 |
+--------+------------+--------+--------+--------+
2 rows in set (1.04 sec)

In this example, only 47978 and 254466 are in both VIEWs so we get these rows. Notice that the first three columns are information from small_employees and the last two columns are from small_salaries– this ordering comes from the use of SELECT sem.*, ssa.* in both queries. I note this here because this ordering is important for understanding the other JOINs.

Finally, we visualize the INNER JOIN using a Venn diagram, as introduced above. Here the overlap of the circles reflects the INNER JOIN and shows emp_no‘s that are returned:

The (INNER) JOIN

LEFT OUTER JOIN

Next, we try the LEFT OUTER JOIN (or LEFT JOIN). Both:

SELECT
  sem.*, ssa.*
FROM
  small_employees AS sem
LEFT OUTER JOIN
  small_salaries AS ssa
ON
  sem.emp_no = ssa.emp_no;

and:

SELECT
  sem.*, ssa.*
FROM
  small_employees AS sem
LEFT JOIN
  small_salaries AS ssa
ON
  sem.emp_no = ssa.emp_no;

produce:

+--------+------------+--------+--------+--------+
| emp_no | first_name | gender | emp_no | salary |
+--------+------------+--------+--------+--------+
|  47978 | Xiahua     | M      |  47978 | 155709 |
| 237542 | Weicheng   | F      |   NULL |   NULL |
| 254466 | Honesty    | M      | 254466 | 156286 |
+--------+------------+--------+--------+--------+
3 rows in set (1.05 sec)

In the LEFT JOIN all rows in the first– this makes it the left– TABLE or VIEW are returned. In this example, small_employees is the first (left) VIEW, as defined in the FROM section of the query. So, all of its rows are returned. However, the emp_no 237542 does not appear in the second (right) VIEW small_salaries so NULLs appear in the second row. The NULLs appear in the fourth and fifth columns, corresponding to the small_salaries VIEW – this is why the column ordering is important to remember.

Using the Venn diagram, we can show the results of the LEFT JOIN as:

The LEFT (OUTER) JOIN

where the highlighted emp_no‘s are returned.

RIGHT OUTER JOIN

Finally, we try the RIGHT OUTER JOIN (or RIGHT JOIN). Again, both:

SELECT
  sem.*, ssa.*
FROM
  small_employees AS sem
RIGHT OUTER JOIN
  small_salaries AS ssa
ON
  sem.emp_no = ssa.emp_no;

and:

SELECT
  sem.*, ssa.*
FROM
  small_employees AS sem
RIGHT JOIN
  small_salaries AS ssa
ON
  sem.emp_no = ssa.emp_no;

produce:

+--------+------------+--------+--------+--------+
| emp_no | first_name | gender | emp_no | salary |
+--------+------------+--------+--------+--------+
|  47978 | Xiahua     | M      |  47978 | 155709 |
|   NULL | NULL       | NULL   | 253939 | 155513 |
| 254466 | Honesty    | M      | 254466 | 156286 |
+--------+------------+--------+--------+--------+
3 rows in set (0.87 sec)

In this case the RIGHT JOIN returns all rows in the right (second) TABLE or VIEW. For our example the second (right) VIEW is small_salaries so the NULLs appear in the first three columns, corresponding to the small_employees VIEW. The Venn diagram nicely visualizes the RIGHT JOIN as:

The RIGHT (OUTER) JOIN

Summing Up

So, that’s it. Hopefully JOINs of all types make more sense and you found the Venn diagrams a useful tool for visualizing the results of the different JOIN types. As always, corrections, comments and questions are welcome below.

Tags

api [1]   arduino [1]   audio [2]   audio features [1]   babel [1]   Bayesian [7]   Beta [1]   blog setup [1]   bootstrap [1]   bottleneck [1]   c++ [1]   caret [1]   cmpy [1]   conditional probability [6]   coursera [1]   coursera intro to data science [3]   css [1]   cython [1]   d3 [2]   decision trees [2]   diy [1]   dropbox [1]   dsp [1]   e1071 [1]   essentia [1]   garmin [1]   geojson [1]   ggplot2 [1]   gis [2]   git [1]   gnuplot [1]   graphs [1]   html5 [1]   igraph [1]   ipython [1]   javascript [7]   joint probability [6]   json [1]   LaTeX [2]   LDA [1]   Lea [2]   machine learning [3]   marginal probability [6]   matplotlib [1]   meteor [2]   mir [1]   MongoDB [3]   music [2]   my python setup [5]   my ubuntu setup [10]   mysql [3]   networks [1]   networkx [1]   nodejs [5]   npm [3]   numexpr [1]   numpy [1]   octave [1]   Open Oakland [2]   openpyxl [1]   pandas [3]   patsy [1]   pip [2]   pweave [1]   pygraphviz [1]   pymc [1]   PySoundFile [2]   python [15]   Python [1]   python 2.7 [5]   python 3.4 [2]   pyyaml [1]   qgis [1]   R [1]   randomForest [1]   restview [1]   resume [1]   rpart [1]   running [1]   scikit-learn [3]   scipy [1]   screen [1]   server setup [1]   shapefile [1]   social networks [1]   Socrata [1]   sound [2]   sphinx [1]   sql [4]   sqlite3 [1]   ssh [1]   ssh keys [1]   statsmodels [1]   supervised learning [2]   sympy [1]   tableau [1]   tinkerer [2]   topic models [1]   tree [1]   ubuntu 14.04 [13]   Ubuntu 14.04 [3]   ubuntu 16.04 [4]   vim [2]   virtualbox [1]   virtualenv [4]   virtualenvwrapper [3]   VPS [1]   vundle [1]   webpack [1]   yaml [1]