Window functions are an advanced feature offered by MySQL to improve the execution performance of queries. These functions act on a group of rows related to the targeted row called window frame. Unlike a GROUP BY
clause, Window functions do not collapse the rows to a single row — preserving the details of each row instead. This new approach to querying data is invaluable in data analytics and business intelligence.
Aggregate functions are used to return a single scalar value from a set of rows. Some prominent aggregate functions available in MySQL are SUM
, MIN
, MAX
, AVG
, and COUNT
. We can use these functions combined with the GROUP BY
clause to get an aggregated value.
In contrast, window functions return a corresponding value for each of the targeted rows. These targeted rows, or the set of rows on which the window function operates, is called a window frame. Window functions use the OVER
clause to define the window frame. A window function can include an aggregate function as a part of its SQL statement by using the OVER
clause instead of GROUP BY
.
The following are the specialized window functions MySQL offers:
Please refer to the official MySQL documentation for in-depth information regarding each of the above functions.
Now let’s see exactly how to utilize some of the Window functions mentioned above.
I will be using the latest MySQL server instance with Arctype as the SQL client. Following is the structure of our sample database:
We can use the following SQL script to create the table structure with the Arctype client:
CREATE TABLE departments (
dep_id INT (10) AUTO_INCREMENT PRIMARY KEY,
dep_name VARCHAR (30) NOT NULL,
dep_desc VARCHAR (150) NULL
);
CREATE TABLE employees (
emp_id INT (10) AUTO_INCREMENT PRIMARY KEY,
first_name VARCHAR (20) NOT NULL,
last_name VARCHAR (25) NOT NULL,
email VARCHAR (100) NOT NULL,
phone VARCHAR (20) DEFAULT NULL,
salary DECIMAL (8, 2) NOT NULL,
dep_id INT (10) NOT NULL,
FOREIGN KEY (dep_id) REFERENCES
departments (dep_id)
ON DELETE CASCADE
ON UPDATE CASCADE
);
CREATE TABLE evaluations (
eval_id INT (10) AUTO_INCREMENT PRIMARY KEY,
emp_id INT (10) NOT NULL,
eval_date DATETIME NOT NULL,
eval_name VARCHAR (30) NOT NULL,
notes TEXT DEFAULT NULL,
marks DECIMAL (4,2) NOT NULL,
FOREIGN KEY (emp_id) REFERENCES employees (emp_id)
);
CREATE TABLE overtime (
otime_id INT (10) AUTO_INCREMENT PRIMARY KEY,
emp_id INT (10) NOT NULL,
otime_date DATETIME NOT NULL,
no_of_hours DECIMAL (4,2) NOT NULL,
FOREIGN KEY (emp_id) REFERENCES employees (emp_id)
);
After creating the tables, we can insert some sample data into each table using proper relationships. Now, let’s get back into Window functions.
ROW_NUMBER()
In our sample database, the employee table is arranged according to the emp_id
. However, if we need to get a separate sequential number assigned to each row, then we can use the ROW_NUMBER()
window function.
In the following example, we are using the ROW_NUMBER()
function while ordering each row by salary amount.
We will get the following result if we query just using the GROUP BY
clause.
SELECT * FROM employees ORDER BY salary DESC;
We can see that a sequential number has been assigned to each row after associating an individual row number using the ROW_NUMBER()
function:
SELECT
ROW_NUMBER() OVER( ORDER BY salary DESC) `row_num`,
first_name,
last_name,
salary
FROM
employees;
RESULT:
Another usage of the ROW_NUMBER
function is for pagination. For example, suppose we need to display the employee details in a paginated format, with each page consisting of just five records. This can be achieved through the ROW_NUMBER
function and WHERE
clause to point to the desired recordset:
WITH page_result AS (
SELECT
ROW_NUMBER() OVER(
ORDER BY salary DESC
) `row_num`,
first_name,
last_name,
salary
FROM
employees
)
SELECT * FROM page_result WHERE `row_num` BETWEEN 6 AND 10
RESULT:
Using the PARTITION BY
clause enables us to partition employees based on the department. The following query can be used to get the salary scale of employees partitioned by each department.
SELECT
dep_name,
ROW_NUMBER() OVER (
PARTITION BY dep_name
ORDER BY salary DESC
) `row_num`,
first_name,
last_name,
salary,
email
FROM
employees AS emp
INNER JOIN departments AS dep
ON dep.dep_id = emp.dep_id
RESULT:
We can further extend this query to get the highest-paid employee of each department by extracting the row where row_num
is equal to one. (As we have partitioned employees by each department, the ROW_NUMBER
starts a new sequence for each partition.)
SELECT
ROW_NUMBER() OVER (
ORDER BY dep_name DESC
) `row_num`,
dep_name,
first_name,
last_name,
salary,
email
FROM
(
SELECT
dep_name,
ROW_NUMBER() OVER (
PARTITION BY dep_name
ORDER BY salary DESC
) `row_num`,
first_name,
last_name,
salary,
email
FROM
employees AS emp
INNER JOIN departments AS dep
ON dep.dep_id = emp.dep_id
) AS highest_paid
WHERE
`row_num` = 1
RESULT:
LAG()
The LAG
function enables users to access preceding rows using a specified offset. This kind of function is useful when we need to compare the values of the preceding rows with the current row. In our data set, we have a table named evaluations which include yearly employee evaluations. Using LAG
, we can identify the performance of each employee and determine if they have improved or not.
First, let us write a query against the ‘evaluations
‘ table to identify the basic output of the LAG
function. In that query, we will partition employees by emp_id
(employee id) and order that partition by the eval_date
(evaluation date).
SELECT
emp_id,
DATE(eval_date) AS `date`,
eval_name,
marks,
LAG(marks) OVER (
PARTITION BY emp_id ORDER BY eval_date
) AS previous
FROM
evaluations;
RESULT:
From the above result set, we can see that the LAG function returns the corresponding previous value for the ‘marks
‘ column. Then we need to further refine this data set to get a numerical percentage to identify the year-over-year employee performance.
WITH emp_evaluations AS (
SELECT
emp_id,
YEAR(eval_date) AS `year`,
eval_name,
marks,
LAG(marks,1,0) OVER (
PARTITION BY emp_id
ORDER BY eval_date
) AS previous
FROM
evaluations
)
SELECT
emp_id,
`year`,
eval_name,
marks,
previous,
IF (previous = 0, '0%',
CONCAT(ROUND((marks - previous)*100/previous, 2), '%')
) AS difference
FROM
emp_evaluations;
In the above query, we have defined a common table expression (CTE) to obtain the results of the initial LAG
query called emp_evaluations
. There are a couple of differences from the original query.
One is that here, we are extracting only the year value from the eval_date DATETIME
field, and the other is that we have defined an offset and a default value (1
as the offset and 0
as the default value) in the LAG
function. This default value will be populated when there are no previous rows, such as the beginning of each partition.
Then we query the emp_evaluations result set to calculate the difference between the ‘marks
‘ and the ‘previous
‘ column for each row.
Here we have defined an IF
condition to identify empty previous values (previous = 0
) and show them as no difference (0%
) or otherwise calculate the difference. Without this IF
condition, the first row of each partition will be shown as a null
value. This query will provide the following formatted output as a result.
DENSE_RANK()
The DENSE_RANK
function can be used to assign ranks to rows in partitions without any gaps. If the targeted column has the same value in multiple rows, DENSE_RANK
will assign the same rank for each of those rows.
In the previous section, we identified the year-over-year performance of employees. Now let’s assume that we are offering a bonus to the most improved employee in each department. In that case, we can use DENSE_RANK
to assign a rank to the performance difference of employees.
First, let us modify the query in the LAG
function section to create a view from the resulting data set. As we simply need to query (SELECT
) the data here, a MySQL view would be an ideal solution. We have modified the SELECT
statement in emp_evaluations
to include the relevant department, first and last names by joining the evaluations
, employees
, and departments
tables.
CREATE VIEW emp_eval_view AS
WITH emp_evaluations AS (
SELECT
eval.emp_id AS `empid`,
YEAR(eval.eval_date) AS `eval_year`,
eval.eval_name AS `evaluation`,
eval.marks AS `mark`,
LAG(eval.marks,1,0) OVER (
PARTITION BY eval.emp_id
ORDER BY eval.eval_date
) AS `previous`,
dep.dep_name AS `department`,
emp.first_name AS `first_name`,
emp.last_name AS `last_name`
FROM
evaluations AS eval
INNER JOIN employees AS emp ON emp.emp_id = eval.emp_id
INNER JOIN departments AS dep ON dep.dep_id = emp.dep_id
)
SELECT
empid,
first_name,
last_name,
department,
`eval_year`,
evaluation,
mark,
previous,
IF (previous = 0, '0%',
CONCAT(ROUND((mark - previous)*100/previous, 2), '%')
) AS difference
FROM
emp_evaluations;
RESULT:
Then using this view (emp_eval_view
) we use the DENSE_RANK
function to assign a rank to each row partitioned by the department and ordered by the difference in a descending manner. Additionally, we only select records related to the specified year (`eval_year` = 2020
).
SELECT
empid,
first_name,
last_name,
department,
`eval_year`,
evaluation,
difference AS 'improvement',
DENSE_RANK() OVER (
PARTITION BY Department
ORDER BY Difference DESC
) AS performance_rank
FROM
emp_eval_view
WHERE
`eval_year` = 2020
RESULT:
Finally, we can filter the above result set to identify the highest performing individual in each department by using the WHERE
clause to get the first ranking record (performance_rank = 1
), as shown below.
SELECT *
FROM (
SELECT
empid,
first_name,
last_name,
department,
`eval_year`,
evaluation,
difference AS 'improvement',
DENSE_RANK() OVER (
PARTITION BY Department
ORDER BY Difference DESC
) AS performance_rank
FROM
emp_eval_view
WHERE
`eval_year` = 2020
) AS yearly_performance_data
WHERE
performance_rank = 1
RESULT:
As we can see from the above result set, a business can use this DENSE_RANK
function to identify top-performing or underperforming employees and departments. These kinds of metrics are crucial for business intelligence processes, and all the credit goes to MySQL Windows functions.
FIRST_VALUE()
and LAST_VALUE()
to Get First and Last Values from a PartitionThe FIRST_VALUE
function enables users to get the first value from an ordered partition while LAST_VALUE
gets the opposite, the last value of a result set. These functions can be used for our data set to identify the employees who did the least and most overtime in each department.
FIRST_VALUE()
We can use the FIRST_VALUE
function to get the employees who did the least overtime in each respective department.
In the following SQL statement, we have defined a common table expression to calculate overtime done by each employee for each month using the SUM
aggregate function. Then using the FIRST_VALUE
window function, we are getting the concatenated details (first and last names with the overtime value) of the employee who did the least overtime in a specific department. This partitioning is done via the PARTITION BY
statement.
WITH overtime_details AS (
SELECT
MONTHNAME(otime.otime_date) AS `month`,
dep.dep_name AS `dep_name`,
emp.emp_id AS `emp_id`,
emp.first_name AS `first_name`,
emp.last_name AS `last_name`,
SUM(otime.no_of_hours) AS `overtime`
FROM
overtime AS otime
INNER JOIN employees AS emp ON emp.emp_id = otime.emp_id
INNER JOIN departments AS dep ON dep.dep_id = emp.dep_id
GROUP BY `month`, emp.emp_id
ORDER BY `month`, emp.emp_id ASC
)
SELECT
dep_name,
emp_id,
first_name,
last_name,
`month`,
overtime,
FIRST_VALUE (CONCAT(first_name,' ',last_name,' - ',overtime)) OVER (
PARTITION BY dep_name
ORDER BY overtime
) least_overtime
FROM
overtime_details;
This will provide a result set similar to the following, indicating the employee who did the least over time.
LAST_VALUE()
We can use the LAST_VALUE
window function to get the employee who did the most amount of overtime in each department. The syntax and the logic are identical to the FIRST_VALUE
SQL statement yet with the addition of a ‘frame clause’ to define a subset of the current partition where the LAST_VALUE
function needs to be applied.
We are using the:
RANGE BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING
as the frame clause. This essentially informs the database engine that the frame starts at the first row and ends at the last row of the result set. (In our query, this applies to each partition)
WITH overtime_details AS (
SELECT
MONTHNAME(otime.otime_date) AS `month`,
dep.dep_name AS `dep_name`,
emp.emp_id AS `emp_id`,
emp.first_name AS `first_name`,
emp.last_name AS `last_name`,
SUM(otime.no_of_hours) AS `overtime`
FROM
overtime AS otime
INNER JOIN employees AS emp ON emp.emp_id = otime.emp_id
INNER JOIN departments AS dep ON dep.dep_id = emp.dep_id
GROUP BY `month`, emp.emp_id
ORDER BY `month`, emp.emp_id ASC
)
SELECT
dep_name,
emp_id,
first_name,
last_name,
`month`,
overtime,
LAST_VALUE (CONCAT(first_name,' ',last_name,' - ',overtime)) OVER (
PARTITION BY dep_name
ORDER BY overtime
RANGE BETWEEN
UNBOUNDED PRECEDING AND
UNBOUNDED FOLLOWING
) most_overtime
FROM
overtime_details;
This would provide us with the details of the employees who did the most overtime in each department.
Window functions in MySQL are a welcome addition to an already excellent database. In this article, we mainly covered how to use window functions with some practical examples. The next step is to dig even further into MySQL window functions and mix them with all the other available MySQL functionality to meet any business requirement.
Source: https://dzone.com/articles/use-mysql-window-functions-to-get-more-out-of-your
Department of Information Technologies: https://www.ibu.edu.ba/department-of-information-technologies/