__Analytic Functions : An introduction__Analytic functions are a set of special functions provided by Oracle to make our life easier. In today' World, Business, on a day to day basis, come up,very commonly, with requests of data, the retrieval of which, using commonplace sql becomes a brainstroming exercise and usually results in a poorly performing sql query. Some of these queries are

**TOP-N queries, pivoting resultsets, moving averages, runnig totals, percentages within group etc.**

With the help of analytic functions, we can achieve these tasks without much difficulty or complex designs and yet produce an optimally performing query.

Let us start with a simple example of using an analytic function and see how it scores over conventional sql.

The following is a query which calculates the department wise running total of salaries.

Code :

break on deptno skip 2 SELECT ename, deptno, sal, SUM (sal) OVER (ORDER BY deptno, ename) running_total, SUM (sal) OVER (PARTITION BY deptno ORDER BY ename) department_total, ROW_NUMBER () OVER (PARTITION BY deptno ORDER BY ename) seq FROM emp ORDER BY deptno, ename; ENAME DEPTNO SAL RUNNING_TOTAL DEPARTMENT_TOTAL SEQ ---------- ---------- ---------- ------------- ---------------- ---------- CLARK 10 2450 2450 2450 1 KING 5000 7450 7450 2 MILLER 1300 8750 8750 3 ADAMS 20 1100 9850 1100 1 FORD 3000 12850 4100 2 JONES 2975 15825 7075 3 SCOTT 3000 18825 10075 4 SMITH 800 19625 10875 5 ALLEN 30 1600 21225 1600 1 BLAKE 2850 24075 4450 2 JAMES 950 25025 5400 3 MARTIN 1250 26275 6650 4 TURNER 1500 27775 8150 5 WARD 1250 29025 9400 6

In the above code, we were able to compute a RUNNING_TOTAL for the entire query. This was done using the entire ordered resultset, via SUM(SAL) OVER (ORDER BY DEPTNO, ENAME). We were also able to compute a running total within each department, a total that would be reset at the beginning of the next department. The PARTITION BY DEPTNO in that SUM(SAL) caused this to happen ‐ a partitioning clause was specified in the query in order to break the data up into groups.

The ROW_NUMBER() function is used to sequentially number the rows returned in each group, according to our ordering criteria (a SEQ column was added to in order to display this position). So, we see that SCOTT is the fourth row in department 20, when ordered by ENAME. This ROW_NUMBER() feature has many uses elsewhere, for example to transpose or pivot resultsets.

The architecture of the function may be a bit complex at the first look, but it is really easy once you understand it and just give the flavour of how efficient these queries are we create a conventional query that will give the same output and run both of them against a large table with 2000 rows and see the statistics.

Code :

CREATE TABLE t AS SELECT object_name ename, MOD (object_id, 50) deptno, object_id sal FROM all_objects WHERE ROWNUM <= 2000 Table created. CREATE INDEX t_idx ON t(deptno,ename); set autotrace traceonly SELECT ename, deptno, sal, SUM (sal) OVER (ORDER BY deptno, ename) running_total, SUM (sal) OVER (PARTITION BY deptno ORDER BY ename) department_total, ROW_NUMBER () OVER (PARTITION BY deptno ORDER BY ename) seq FROM t emp ORDER BY deptno, ename; 2000 rows selected. Elapsed: 00:00:00.11 Execution Plan ---------------------------------------------------------- Plan hash value: 2335850315 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 2000 | 46000 | 4 (0)| 00:00:01 | | 1 | WINDOW SORT | | 2000 | 46000 | 4 (0)| 00:00:01 | | 2 | TABLE ACCESS FULL| T | 2000 | 46000 | 4 (0)| 00:00:01 | --------------------------------------------------------------------------- Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 11 consistent gets 0 physical reads 0 redo size 94487 bytes sent via SQL*Net to client 1859 bytes received via SQL*Net from client 135 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 2000 rows processed SELECT ename, deptno, sal, (SELECT SUM (sal) FROM t e2 WHERE e2.deptno < emp.deptno OR (e2.deptno = emp.deptno AND e2.ename <= emp.ename)) running_total, (SELECT SUM (sal) FROM t e3 WHERE e3.deptno = emp.deptno AND e3.ename <= emp.ename) department_total, (SELECT COUNT (ename) FROM t e3 WHERE e3.deptno = emp.deptno AND e3.ename <= emp.ename) seq FROM t emp ORDER BY deptno, ename; 2000 rows selected. Elapsed: 00:00:01.03 Execution Plan ---------------------------------------------------------- Plan hash value: 92139442 -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 2000 | 46000 | 5 (20)| 00:00:01 | | 1 | SORT AGGREGATE | | 1 | 23 | | | |* 2 | TABLE ACCESS FULL | T | 102 | 2346 | 4 (0)| 00:00:01 | | 3 | SORT AGGREGATE | | 1 | 23 | | | | 4 | TABLE ACCESS BY INDEX ROWID| T | 2 | 46 | 3 (0)| 00:00:01 | |* 5 | INDEX RANGE SCAN | T_IDX | 1 | | 2 (0)| 00:00:01 | | 6 | SORT AGGREGATE | | 1 | 20 | | | |* 7 | INDEX RANGE SCAN | T_IDX | 2 | 40 | 2 (0)| 00:00:01 | | 8 | SORT ORDER BY | | 2000 | 46000 | 5 (20)| 00:00:01 | | 9 | TABLE ACCESS FULL | T | 2000 | 46000 | 4 (0)| 00:00:01 | -------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("E2"."DEPTNO"<:B1 OR "E2"."DEPTNO"=:B2 AND "E2"."ENAME"<=:B3) 5 - access("E3"."DEPTNO"=:B1 AND "E3"."ENAME"<=:B2) 7 - access("E3"."DEPTNO"=:B1 AND "E3"."ENAME"<=:B2) Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 52811 consistent gets 0 physical reads 0 redo size 94487 bytes sent via SQL*Net to client 1859 bytes received via SQL*Net from client 135 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 2000 rows processed

**Just see the difference ! The analytical function took fewer resources and less amount of time.**

So let's dwelve into the anatomy of these functions and let us see how they work.

**Types of Analytic Functions**

There are basically 26 types of analytic functions. They fall into three categories.

Ranking Functions :These are typically used for TOP N type of queries.An example of this type of function is the row_number() function we have just used.

Windowing Functions :These are used to calculate different type of aggregates.These act on a window or group of data. sum() is one such example we have used.The others are count(),avg(),min(),max() etc.

Reporting functions:These are very similer to the Windowing functions except for the fact that these functions act on the whole group or partition of data as compared to the windowing functions acting on the window of the data. For example, in our query we used the windowing function :

sum(sal) over(partition by deptno order by ename)

If we remove the order by clause from the above we get a reporting function.

sum(sal) over(partition by deptno).

In the absence of the ORDER BY, the function is applied to every row in the group. With an ORDER BY clause, it is applied to a window.

**Architecture of Analytic Function**.

Syntax

FUNCTION_NAME(<argument>,<argument>,…)

OVER

(<Partition‐Clause> <Order‐by‐Clause> <Windowing Clause>)

Over is the key word that identifies the function as analytic.Without this keyword there would no difference between the 'analytic' sum() and the 'aggregate' sum().

The clause that follows the over keyword, defines the data set that the function will act upon.

Partition-Clause is an optional partitioning clause.It is use to define the resultset into groups of data the function will act upon. If this clause is ommitted then the function will applied to the entire resultset.

Order by-clause is a clause mandatory for some analytic funtions and optional for others.Functions like Lag and Lead mandates the use of order by clause as they need an ordered set data to access the next and the previous rows.This clause is mandatory when using a windowing function as explained above.

Windowing-clause is again an optional clause and is a bit confusing at first.This gives as a way to define a sliding or anchored window of data, on which the function will operate, within a group.There are different keywords associated with this clause such as 'RANGE BETWEEN' 'UNBOUNDED PRECEDING' 'CURRENT ROW' etc.

We will dwelve more into this clause on my next thread.

**List of Analytic Functions**

AVG

Returns a running average

CORR

Returns the coefficient of correlation of a set of number pairs

COUNT

Returns a running count of all records or by partition

Return a frequency distribution

COVAR_POP

Returns the population covariance of a set of number pairs

COVAR_SAMP

Returns the sample covariance of a set of number pairs

CUME_DIST

Returns the cumulative distribution of a value in a group of values

DENSE_RANK

Ranks items in a group leaving no gaps in ranking sequence when there are ties

FIRST

Returns the row ranked first using DENSE_RANK

FIRST_VALUE

Returns the first value in an ordered set of values. If the first value in the set is null, then the function returns NULL unless you specify IGNORE NULLS

IGNORE NULLS

LAG

LAG provides access to more than one row of a table at the same time without a self-join. Given a series of rows returned from a query and a position of the cursor, LAG provides access to a row at a given physical offset prior to that position

LAST

Returns the row ranked last using DENSE_RANK

LAST_VALUE

Returns the last value in an ordered set of values. If the last value in the set is null, then the function returns NULL unless you specify IGNORE NULLS. This setting is useful for data densification. If you specify IGNORE NULLS, then LAST_VALUE returns the first non-null value in the set, or NULL if all values are null.

LEAD

LEAD provides access to a row at a given physical offset beyond that position

MAX

Returns the maximum value by

partition

MIN

Returns the minimum value by partition

NTILE

Divides an ordered data set into a number of buckets indicated by expr and assigns the appropriate bucket number to each row. The buckets are numbered 1 through expr. The expr value must resolve to a positive constant for each partition.

OVER PARTITION BY

Partitions resultset

PERCENT_RANK

For a row r, PERCENT_RANK calculates the rank of r minus 1, divided by 1 less than the number of rows being evaluated (the entire query result set or a partition).

PERCENTILE_CONT

Inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.

PERCENTILE_DISC

An inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation.

RANK

Calculates the rank of a value in a group of values

RATIO_TO_REPORT

Computes the ratio of a value to the sum of a set of values. If expr evaluates to null, then the ratio-to-report value also evaluates to null.

REGR_ (Linear Regression) Functions

more on this on upcoming threads

STDDEV

Returns the sample standard deviation of an expression

STDDEV_POP

Computes the population standard deviation and returns the square root of the population variance

STDDEV_SAMP

Computes the cumulative sample standard deviation and returns the square root of the sample variance.

SUM

Computes the cumulative sample running sum

VAR_POP

Returns the population variance of a set of numbers

VAR_SAMP

Returns the sample variance of a set of numbers

VARIANCE

Returns the variance of an expression

....To be continued in:

Analytic Functions Part II, The Windowing Clause

Analytic Functions III : Conclusion

Cheers

simply_dba