Analytic functions are a set of special functions provided by Oracle to make our life easier. In today' World, Business, on a day to day basis, come up,very commonly, with requests of data, the retrieval of which, using commonplace sql becomes a brainstroming exercise and usually results in a poorly performing sql query. Some of these queries are
TOP-N queries, pivoting resultsets, moving averages, runnig totals, percentages within group etc.
With the help of analytic functions, we can achieve these tasks without much difficulty or complex designs and yet produce an optimally performing query.
Let us start with a simple example of using an analytic function and see how it scores over conventional sql.
The following is a query which calculates the department wise running total of salaries.
Code :
break on deptno skip 2
SELECT ename, deptno, sal,
SUM (sal) OVER (ORDER BY deptno, ename) running_total,
SUM (sal) OVER (PARTITION BY deptno ORDER BY ename) department_total,
ROW_NUMBER () OVER (PARTITION BY deptno ORDER BY ename) seq
FROM emp
ORDER BY deptno, ename;
ENAME DEPTNO SAL RUNNING_TOTAL DEPARTMENT_TOTAL SEQ
---------- ---------- ---------- ------------- ---------------- ----------
CLARK 10 2450 2450 2450 1
KING 5000 7450 7450 2
MILLER 1300 8750 8750 3
ADAMS 20 1100 9850 1100 1
FORD 3000 12850 4100 2
JONES 2975 15825 7075 3
SCOTT 3000 18825 10075 4
SMITH 800 19625 10875 5
ALLEN 30 1600 21225 1600 1
BLAKE 2850 24075 4450 2
JAMES 950 25025 5400 3
MARTIN 1250 26275 6650 4
TURNER 1500 27775 8150 5
WARD 1250 29025 9400 6In the above code, we were able to compute a RUNNING_TOTAL for the entire query. This was done using the entire ordered resultset, via SUM(SAL) OVER (ORDER BY DEPTNO, ENAME). We were also able to compute a running total within each department, a total that would be reset at the beginning of the next department. The PARTITION BY DEPTNO in that SUM(SAL) caused this to happen ‐ a partitioning clause was specified in the query in order to break the data up into groups.
The ROW_NUMBER() function is used to sequentially number the rows returned in each group, according to our ordering criteria (a SEQ column was added to in order to display this position). So, we see that SCOTT is the fourth row in department 20, when ordered by ENAME. This ROW_NUMBER() feature has many uses elsewhere, for example to transpose or pivot resultsets.
The architecture of the function may be a bit complex at the first look, but it is really easy once you understand it and just give the flavour of how efficient these queries are we create a conventional query that will give the same output and run both of them against a large table with 2000 rows and see the statistics.
Code :
CREATE TABLE t AS
SELECT object_name ename, MOD (object_id, 50) deptno, object_id sal
FROM all_objects
WHERE ROWNUM <= 2000
Table created.
CREATE INDEX t_idx ON t(deptno,ename);
set autotrace traceonly
SELECT ename, deptno, sal,
SUM (sal) OVER (ORDER BY deptno, ename) running_total,
SUM (sal) OVER (PARTITION BY deptno ORDER BY ename) department_total,
ROW_NUMBER () OVER (PARTITION BY deptno ORDER BY ename) seq
FROM t emp
ORDER BY deptno, ename;
2000 rows selected.
Elapsed: 00:00:00.11
Execution Plan
----------------------------------------------------------
Plan hash value: 2335850315
---------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
---------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2000 | 46000 | 4 (0)| 00:00:01 |
| 1 | WINDOW SORT | | 2000 | 46000 | 4 (0)| 00:00:01 |
| 2 | TABLE ACCESS FULL| T | 2000 | 46000 | 4 (0)| 00:00:01 |
---------------------------------------------------------------------------
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
11 consistent gets
0 physical reads
0 redo size
94487 bytes sent via SQL*Net to client
1859 bytes received via SQL*Net from client
135 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
2000 rows processed
SELECT ename, deptno, sal,
(SELECT SUM (sal)
FROM t e2
WHERE e2.deptno < emp.deptno
OR (e2.deptno = emp.deptno AND e2.ename <= emp.ename))
running_total,
(SELECT SUM (sal)
FROM t e3
WHERE e3.deptno = emp.deptno
AND e3.ename <= emp.ename) department_total,
(SELECT COUNT (ename)
FROM t e3
WHERE e3.deptno = emp.deptno AND e3.ename <= emp.ename) seq
FROM t emp
ORDER BY deptno, ename;
2000 rows selected.
Elapsed: 00:00:01.03
Execution Plan
----------------------------------------------------------
Plan hash value: 92139442
--------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 2000 | 46000 | 5 (20)| 00:00:01 |
| 1 | SORT AGGREGATE | | 1 | 23 | | |
|* 2 | TABLE ACCESS FULL | T | 102 | 2346 | 4 (0)| 00:00:01 |
| 3 | SORT AGGREGATE | | 1 | 23 | | |
| 4 | TABLE ACCESS BY INDEX ROWID| T | 2 | 46 | 3 (0)| 00:00:01 |
|* 5 | INDEX RANGE SCAN | T_IDX | 1 | | 2 (0)| 00:00:01 |
| 6 | SORT AGGREGATE | | 1 | 20 | | |
|* 7 | INDEX RANGE SCAN | T_IDX | 2 | 40 | 2 (0)| 00:00:01 |
| 8 | SORT ORDER BY | | 2000 | 46000 | 5 (20)| 00:00:01 |
| 9 | TABLE ACCESS FULL | T | 2000 | 46000 | 4 (0)| 00:00:01 |
--------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
2 - filter("E2"."DEPTNO"<:B1 OR "E2"."DEPTNO"=:B2 AND "E2"."ENAME"<=:B3)
5 - access("E3"."DEPTNO"=:B1 AND "E3"."ENAME"<=:B2)
7 - access("E3"."DEPTNO"=:B1 AND "E3"."ENAME"<=:B2)
Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
52811 consistent gets
0 physical reads
0 redo size
94487 bytes sent via SQL*Net to client
1859 bytes received via SQL*Net from client
135 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
2000 rows processedJust see the difference ! The analytical function took fewer resources and less amount of time.
So let's dwelve into the anatomy of these functions and let us see how they work.
Types of Analytic Functions
There are basically 26 types of analytic functions. They fall into three categories.
Ranking Functions :These are typically used for TOP N type of queries.An example of this type of function is the row_number() function we have just used.
Windowing Functions :These are used to calculate different type of aggregates.These act on a window or group of data. sum() is one such example we have used.The others are count(),avg(),min(),max() etc.
Reporting functions:These are very similer to the Windowing functions except for the fact that these functions act on the whole group or partition of data as compared to the windowing functions acting on the window of the data. For example, in our query we used the windowing function :
sum(sal) over(partition by deptno order by ename)
If we remove the order by clause from the above we get a reporting function.
sum(sal) over(partition by deptno).
In the absence of the ORDER BY, the function is applied to every row in the group. With an ORDER BY clause, it is applied to a window.
Architecture of Analytic Function.
Syntax
FUNCTION_NAME(<argument>,<argument>,…)
OVER
(<Partition‐Clause> <Order‐by‐Clause> <Windowing Clause>)
Over is the key word that identifies the function as analytic.Without this keyword there would no difference between the 'analytic' sum() and the 'aggregate' sum().
The clause that follows the over keyword, defines the data set that the function will act upon.
Partition-Clause is an optional partitioning clause.It is use to define the resultset into groups of data the function will act upon. If this clause is ommitted then the function will applied to the entire resultset.
Order by-clause is a clause mandatory for some analytic funtions and optional for others.Functions like Lag and Lead mandates the use of order by clause as they need an ordered set data to access the next and the previous rows.This clause is mandatory when using a windowing function as explained above.
Windowing-clause is again an optional clause and is a bit confusing at first.This gives as a way to define a sliding or anchored window of data, on which the function will operate, within a group.There are different keywords associated with this clause such as 'RANGE BETWEEN' 'UNBOUNDED PRECEDING' 'CURRENT ROW' etc.
We will dwelve more into this clause on my next thread.
List of Analytic Functions
AVG
Returns a running average
CORR
Returns the coefficient of correlation of a set of number pairs
COUNT
Returns a running count of all records or by partition
Return a frequency distribution
COVAR_POP
Returns the population covariance of a set of number pairs
COVAR_SAMP
Returns the sample covariance of a set of number pairs
CUME_DIST
Returns the cumulative distribution of a value in a group of values
DENSE_RANK
Ranks items in a group leaving no gaps in ranking sequence when there are ties
FIRST
Returns the row ranked first using DENSE_RANK
FIRST_VALUE
Returns the first value in an ordered set of values. If the first value in the set is null, then the function returns NULL unless you specify IGNORE NULLS
IGNORE NULLS
LAG
LAG provides access to more than one row of a table at the same time without a self-join. Given a series of rows returned from a query and a position of the cursor, LAG provides access to a row at a given physical offset prior to that position
LAST
Returns the row ranked last using DENSE_RANK
LAST_VALUE
Returns the last value in an ordered set of values. If the last value in the set is null, then the function returns NULL unless you specify IGNORE NULLS. This setting is useful for data densification. If you specify IGNORE NULLS, then LAST_VALUE returns the first non-null value in the set, or NULL if all values are null.
LEAD
LEAD provides access to a row at a given physical offset beyond that position
MAX
Returns the maximum value by
partition
MIN
Returns the minimum value by partition
NTILE
Divides an ordered data set into a number of buckets indicated by expr and assigns the appropriate bucket number to each row. The buckets are numbered 1 through expr. The expr value must resolve to a positive constant for each partition.
OVER PARTITION BY
Partitions resultset
PERCENT_RANK
For a row r, PERCENT_RANK calculates the rank of r minus 1, divided by 1 less than the number of rows being evaluated (the entire query result set or a partition).
PERCENTILE_CONT
Inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.
PERCENTILE_DISC
An inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation.
RANK
Calculates the rank of a value in a group of values
RATIO_TO_REPORT
Computes the ratio of a value to the sum of a set of values. If expr evaluates to null, then the ratio-to-report value also evaluates to null.
REGR_ (Linear Regression) Functions
more on this on upcoming threads
STDDEV
Returns the sample standard deviation of an expression
STDDEV_POP
Computes the population standard deviation and returns the square root of the population variance
STDDEV_SAMP
Computes the cumulative sample standard deviation and returns the square root of the sample variance.
SUM
Computes the cumulative sample running sum
VAR_POP
Returns the population variance of a set of numbers
VAR_SAMP
Returns the sample variance of a set of numbers
VARIANCE
Returns the variance of an expression
....To be continued in:
Analytic Functions Part II, The Windowing Clause
Analytic Functions III : Conclusion
Cheers


