The CUME_DIST function
Definition
The CUME_DIST function in SQL Server 2012 calculates the cumulative distribution of a value within a set of values. It determines the relative position of a value in the data set based on its sorted order.
If the data set is sorted in ascending order, CUME_DIST calculates the ratio of rows with values less than or equal to the current value to the total number of rows. If the data set is sorted in descending order, it calculates the ratio of rows with values greater than or equal to the current value to the total number of rows.
Usage example
Run the following code to create a sample data set.
-- Declare a table variable to store data
DECLARE @analytic TABLE(
name varchar(35) , -- Name
dept varchar(35), -- Department
salary money -- Salary
)
-- Insert data into the table variable
INSERT INTO @analytic
VALUES
--bd department
('andy01','bd',15000),
('andy02','bd',12000),
('andy03','bd',12000),
('andy04','bd',10000),
('andy05','bd',8000),
--ca department
('andy06','ca',20000),
('andy07','ca',18000),
('andy08','ca',18000),
('andy09','ca',15000),
('andy10','ca',12000),
('andy11','ca',12000),
('andy12','ca',10000),
('andy13','ca',8000),
('andy14','ca',8000),
('andy15','ca',8000)
-- Use the CUME_DIST function to calculate the cumulative distribution
SELECT
dept, name, salary,
CUME_DIST() OVER(PARTITION BY dept ORDER BY salary) AS cume_dist_
FROM @analytic
ORDER BY dept, salary DESC
In the bd department, the CUME_DIST() values for ascending salaries are: 8000→0.2, 10000→0.4, 12000→0.8 (for two rows), and 15000→1.0. In the ca department, the values are: 8000→0.3 (for three rows), 10000→0.4, 12000→0.6 (for two rows), 15000→0.7, 18000→0.9 (for two rows), and 20000→1.0. Rows with the same salary have the same cumulative distribution value.
Example analysis
In the preceding example, the CUME_DIST() window function calculates the cumulative distribution of each employee's salary within their department. For instance, for employee andy02 in the bd department, with a salary of 12000, there are four rows with a salary less than or equal to 12000. Because the department has five rows, the CUME_DIST() value is 4/5 = 0.8. The same logic applies to all other rows.
The LAST_VALUE function
Definition
The LAST_VALUE function in SQL Server 2012 returns the last value in an ordered set.
Usage example
Run the following code to create a sample data set.
DECLARE
@analytic TABLE(
name varchar(35) , -- Name
dept varchar(35), -- Department
salary money , -- Salary
hiredate date -- Hire Date
)
INSERT INTO @analytic
VALUES
--bd department
('andy01','bd',15000,'2002-01-09'),
('andy02','bd',12000,'2003-01-09'),
('andy03','bd',12000,'2003-02-09'),
('andy04','bd',10000,'2005-05-09'),
('andy05','bd',8000,'2003-06-09'),
--ca department
('andy06','ca',20000,'2003-01-09'),
('andy07','ca',18000,'2005-02-09'),
('andy08','ca',18000,'2005-03-09'),
('andy09','ca',15000,'2004-01-09'),
('andy10','ca',12000,'2003-06-09'),
('andy11','ca',12000,'2002-09-09'),
('andy12','ca',10000,'2003-07-09'),
('andy13','ca',8000,'2003-08-09'),
('andy14','ca',8000,'2003-11-09'),
('andy15','ca',8000,'2003-01-09')
SELECT
dept,name ,salary,hiredate,
LAST_VALUE(hiredate) OVER(PARTITION BY dept ORDER BY salary) AS last_value_
FROM @analytic
In the query result, records for each department are sorted by salary in ascending order. The last_value_ column shows the hiredate value from the last row in the current window frame. For example, in the bd department, the two records with a salary of 12000 (andy02 and andy03) both have a last_value_ of 2003-02-09. The record with the highest salary, andy01 (15000), has a last_value_ of 2002-01-09. The ca department follows the same logic, where records with the same salary share the hiredate of the last record in that salary group.
Example analysis
The query returns each employee's department (dept), name, salary, and hire date (hiredate), along with the last_value_. The results may seem unexpected.
This behavior occurs because by default, LAST_VALUE operates on a window frame defined as RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. To get the last value from the entire partition, you must explicitly specify a frame that covers all rows, such as ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING.
The FIRST_VALUE function
Definition
The FIRST_VALUE function in SQL Server 2012 returns the first value in an ordered set.
Based on their names, you might assume FIRST_VALUE is the opposite of LAST_VALUE. However, this is not the case.
Usage example
Run the following code to create a sample data set.
DECLARE
@analytic TABLE(
name varchar(35) ,
dept varchar(35),
salary money ,
hiredate date
)
INSERT INTO @analytic
VALUES
--bd
('andy01','bd',15000,'2002-01-09'),
('andy02','bd',12000,'2003-01-09'),
('andy03','bd',12000,'2003-02-09'),
('andy04','bd',10000,'2005-05-09'),
('andy05','bd',8000,'2003-06-09'),
--ca
('andy06','ca',20000,'2003-01-09'),
('andy07','ca',18000,'2005-02-09'),
('andy08','ca',18000,'2005-03-09'),
('andy09','ca',15000,'2004-01-09'),
('andy10','ca',12000,'2003-06-09'),
('andy11','ca',12000,'2002-09-09'),
('andy12','ca',10000,'2003-07-09'),
('andy13','ca',8000,'2003-08-09'),
('andy14','ca',8000,'2003-11-09'),
('andy15','ca',8000,'2003-01-09')
SELECT
dept,name ,salary,hiredate,
FIRST_VALUE(name) OVER(PARTITION BY dept ORDER BY salary) AS first_value_
FROM @analytic
The query returns 15 rows. For the 5 rows where dept is bd, the first_value_ column is always andy05. For the 10 rows where dept is ca, the first_value_ column is always andy13. This is because andy05 and andy13 are the name values from the first row of each partition when sorted by salary in ascending order.
Example analysis
In this example, the OVER clause partitions the data by dept then sorts each partition by salary. The FIRST_VALUE function then retrieves a value from the first row of this sorted partition. To see this more clearly, you can change the argument to FIRST_VALUE(hiredate) to view the hire date of the first employee in each department's sorted list.
Unlike LAST_VALUE, the behavior of FIRST_VALUE is intuitive because the default window frame (RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) always includes the first row of the partition. Therefore, FIRST_VALUE and LAST_VALUE are not simple opposites; their behavior differs significantly due to their interaction with the default window frame.
The LEAD function
Definition
In SQL Server 2012, the LEAD(col, n, default) function accesses data from a subsequent row in the same result set without requiring a self-join. This allows you to compare a value in the current row with a value in a subsequent row.
-
col: The column or expression from which to retrieve the value. -
n: The offset, or number of rows to look ahead from the current row. The default is 1 (the next row). -
default: The value to return if the offset is outside the partition. If omitted,NULLis returned.
Usage example
Run the following code to create a sample data set.
DECLARE
@analytic TABLE(
name varchar(35) ,
dept varchar(35),
salary money ,
hiredate date
)
INSERT INTO @analytic
VALUES
--bd department
('andy01','bd',15000,'2002-01-09'),
('andy02','bd',12000,'2003-01-09'),
('andy03','bd',12000,'2003-02-09'),
('andy04','bd',10000,'2005-05-09'),
('andy05','bd',8000,'2003-06-09'),
--ca department
('andy06','ca',20000,'2003-01-09'),
('andy07','ca',18000,'2005-02-09'),
('andy08','ca',18000,'2005-03-09'),
('andy09','ca',15000,'2004-01-09'),
('andy10','ca',12000,'2003-06-09'),
('andy11','ca',12000,'2002-09-09'),
('andy12','ca',10000,'2003-07-09'),
('andy13','ca',8000,'2003-08-09'),
('andy14','ca',8000,'2003-11-09'),
('andy15','ca',8000,'2003-01-09')
SELECT
dept,name,hiredate,salary,
LEAD(salary,1,0) OVER(PARTITION BY dept ORDER BY salary) AS lead_, -- Use the LEAD function to get the next salary value within each department, sorted by salary. Return 0 if no next value exists.
(LEAD(salary,1,0) OVER(PARTITION BY dept ORDER BY salary)-salary) AS diff_salary -- Calculate the difference between the current salary and the next salary.
FROM @analytic
The query result is partitioned by dept and sorted by salary in ascending order. For the bd department: andy05(salary=8000, lead_=10000, diff_salary=2000), andy04(10000, 12000, 2000), andy02(12000, 12000, 0), andy03(12000, 15000, 3000), andy01(15000, 0, -15000). For the ca department: andy13(8000, 8000, 0), andy14(8000, 8000, 0), andy15(8000, 10000, 2000), andy12(10000, 12000, 2000), andy10(12000, 12000, 0), andy11(12000, 15000, 3000), andy09(15000, 18000, 3000), andy07(18000, 18000, 0), andy08(18000, 20000, 2000), andy06(20000, 0, -20000).
Example analysis
This example demonstrates how to use the LEAD function to analyze salary progression within each department. By partitioning the data by dept and ordering by salary, you can compare a row's salary with that of the subsequent row. Calculating the difference is a powerful way to identify salary gaps between employees.