1. The Forum has been upgraded to a modern and advanced system. Please read the announcement about this update.
  2. Get rid of all advertisements and get unlimited access to documents by upgrading to Premium Membership. Upgrade to Premium Now and also get a Premium Badge!

Analytic Functions I : An introduction.

Discussion in 'SQL PL/SQL' started by simply_dba, Oct 17, 2008.

  1. simply_dba

    simply_dba Forum Advisor

    Messages:
    95
    Likes Received:
    5
    Trophy Points:
    140
    Location:
    Kolkata, India
    Analytic Functions : An introduction

    Analytic functions are a set of special functions provided by Oracle to make our life easier. In today' World, Business, on a day to day basis, come up,very commonly, with requests of data, the retrieval of which, using commonplace sql becomes a brainstroming exercise and usually results in a poorly performing sql query. Some of these queries are

    TOP-N queries, pivoting resultsets, moving averages, runnig totals, percentages within group etc.

    With the help of analytic functions, we can achieve these tasks without much difficulty or complex designs and yet produce an optimally performing query.

    Let us start with a simple example of using an analytic function and see how it scores over conventional sql.

    The following is a query which calculates the department wise running total of salaries.

    Code (Text):
    break on deptno skip 2

    SELECT   ename, deptno, sal,
             SUM (sal) OVER (ORDER BY deptno, ename) running_total,
             SUM (sal) OVER (PARTITION BY deptno ORDER BY ename) department_total,
             ROW_NUMBER () OVER (PARTITION BY deptno ORDER BY ename) seq
        FROM emp
    ORDER BY deptno, ename;


    ENAME          DEPTNO        SAL RUNNING_TOTAL DEPARTMENT_TOTAL        SEQ
    ---------- ---------- ---------- ------------- ---------------- ----------
    CLARK              10       2450          2450             2450          1
    KING                        5000          7450             7450          2
    MILLER                      1300          8750             8750          3


    ADAMS              20       1100          9850             1100          1
    FORD                        3000         12850             4100          2
    JONES                       2975         15825             7075          3
    SCOTT                       3000         18825            10075          4
    SMITH                        800         19625            10875          5


    ALLEN              30       1600         21225             1600          1
    BLAKE                       2850         24075             4450          2
    JAMES                        950         25025             5400          3
    MARTIN                      1250         26275             6650          4
    TURNER                      1500         27775             8150          5
    WARD                        1250         29025             9400          6
     
    In the above code, we were able to compute a RUNNING_TOTAL for the entire query. This was done using the entire ordered resultset, via SUM(SAL) OVER (ORDER BY DEPTNO, ENAME). We were also able to compute a running total within each department, a total that would be reset at the beginning of the next department. The PARTITION BY DEPTNO in that SUM(SAL) caused this to happen ‐ a partitioning clause was specified in the query in order to break the data up into groups.

    The ROW_NUMBER() function is used to sequentially number the rows returned in each group, according to our ordering criteria (a SEQ column was added to in order to display this position). So, we see that SCOTT is the fourth row in department 20, when ordered by ENAME. This ROW_NUMBER() feature has many uses elsewhere, for example to transpose or pivot resultsets.

    The architecture of the function may be a bit complex at the first look, but it is really easy once you understand it and just give the flavour of how efficient these queries are we create a conventional query that will give the same output and run both of them against a large table with 2000 rows and see the statistics.

    Code (Text):

    CREATE TABLE t AS
    SELECT object_name ename, MOD (object_id, 50) deptno, object_id sal
      FROM all_objects
     WHERE ROWNUM <= 2000

    Table created.

    CREATE INDEX t_idx ON t(deptno,ename);

    set autotrace traceonly

    SELECT   ename, deptno, sal,
             SUM (sal) OVER (ORDER BY deptno, ename) running_total,
             SUM (sal) OVER (PARTITION BY deptno ORDER BY ename) department_total,
             ROW_NUMBER () OVER (PARTITION BY deptno ORDER BY ename) seq
        FROM t emp
    ORDER BY deptno, ename;

    2000 rows selected.

    Elapsed: 00:00:00.11

    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 2335850315

    ---------------------------------------------------------------------------
    | Id  | Operation          | Name | Rows  | Bytes | Cost (%CPU)| Time     |
    ---------------------------------------------------------------------------
    |   0 | SELECT STATEMENT   |      |  2000 | 46000 |     4   (0)| 00:00:01 |
    |   1 |  WINDOW SORT       |      |  2000 | 46000 |     4   (0)| 00:00:01 |
    |   2 |   TABLE ACCESS FULL| T    |  2000 | 46000 |     4   (0)| 00:00:01 |
    ---------------------------------------------------------------------------


    Statistics
    ----------------------------------------------------------
              0  recursive calls
              0  db block gets
             11  consistent gets
              0  physical reads
              0  redo size
          94487  bytes sent via SQL*Net to client
           1859  bytes received via SQL*Net from client
            135  SQL*Net roundtrips to/from client
              1  sorts (memory)
              0  sorts (disk)
           2000  rows processed



    SELECT   ename, deptno, sal,
             (SELECT SUM (sal)
                FROM t e2
               WHERE e2.deptno < emp.deptno
                  OR (e2.deptno = emp.deptno AND e2.ename <= emp.ename))
                                                                    running_total,
             (SELECT SUM (sal)
                FROM t e3
               WHERE e3.deptno = emp.deptno
                 AND e3.ename <= emp.ename) department_total,
             (SELECT COUNT (ename)
                FROM t e3
               WHERE e3.deptno = emp.deptno AND e3.ename <= emp.ename) seq
        FROM t emp
    ORDER BY deptno, ename;

    2000 rows selected.

    Elapsed: 00:00:01.03

    Execution Plan
    ----------------------------------------------------------
    Plan hash value: 92139442

    --------------------------------------------------------------------------------------
    | Id  | Operation                    | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
    --------------------------------------------------------------------------------------
    |   0 | SELECT STATEMENT             |       |  2000 | 46000 |     5  (20)| 00:00:01 |
    |   1 |  SORT AGGREGATE              |       |     1 |    23 |            |          |
    |*  2 |   TABLE ACCESS FULL          | T     |   102 |  2346 |     4   (0)| 00:00:01 |
    |   3 |  SORT AGGREGATE              |       |     1 |    23 |            |          |
    |   4 |   TABLE ACCESS BY INDEX ROWID| T     |     2 |    46 |     3   (0)| 00:00:01 |
    |*  5 |    INDEX RANGE SCAN          | T_IDX |     1 |       |     2   (0)| 00:00:01 |
    |   6 |  SORT AGGREGATE              |       |     1 |    20 |            |          |
    |*  7 |   INDEX RANGE SCAN           | T_IDX |     2 |    40 |     2   (0)| 00:00:01 |
    |   8 |  SORT ORDER BY               |       |  2000 | 46000 |     5  (20)| 00:00:01 |
    |   9 |   TABLE ACCESS FULL          | T     |  2000 | 46000 |     4   (0)| 00:00:01 |
    --------------------------------------------------------------------------------------

    Predicate Information (identified by operation id):
    ---------------------------------------------------

       2 - filter("E2"."DEPTNO"<:B1 OR "E2"."DEPTNO"=:B2 AND "E2"."ENAME"<=:B3)
       5 - access("E3"."DEPTNO"=:B1 AND "E3"."ENAME"<=:B2)
       7 - access("E3"."DEPTNO"=:B1 AND "E3"."ENAME"<=:B2)


    Statistics
    ----------------------------------------------------------
              1  recursive calls
              0  db block gets
          52811  consistent gets
              0  physical reads
              0  redo size
          94487  bytes sent via SQL*Net to client
           1859  bytes received via SQL*Net from client
            135  SQL*Net roundtrips to/from client
              1  sorts (memory)
              0  sorts (disk)
           2000  rows processed
    Just see the difference ! The analytical function took fewer resources and less amount of time.

    So let's dwelve into the anatomy of these functions and let us see how they work.

    Types of Analytic Functions

    There are basically 26 types of analytic functions. They fall into three categories.

    Ranking Functions :These are typically used for TOP N type of queries.An example of this type of function is the row_number() function we have just used.

    Windowing Functions :These are used to calculate different type of aggregates.These act on a window or group of data. sum() is one such example we have used.The others are count(),avg(),min(),max() etc.

    Reporting functions:These are very similer to the Windowing functions except for the fact that these functions act on the whole group or partition of data as compared to the windowing functions acting on the window of the data. For example, in our query we used the windowing function :
    sum(sal) over(partition by deptno order by ename)
    If we remove the order by clause from the above we get a reporting function.
    sum(sal) over(partition by deptno).
    In the absence of the ORDER BY, the function is applied to every row in the group. With an ORDER BY clause, it is applied to a window.

    Architecture of Analytic Function.

    Syntax

    FUNCTION_NAME(<argument>,<argument>,…)
    OVER
    (<Partition‐Clause> <Order‐by‐Clause> <Windowing Clause>)

    Over is the key word that identifies the function as analytic.Without this keyword there would no difference between the 'analytic' sum() and the 'aggregate' sum().
    The clause that follows the over keyword, defines the data set that the function will act upon.

    Partition-Clause is an optional partitioning clause.It is use to define the resultset into groups of data the function will act upon. If this clause is ommitted then the function will applied to the entire resultset.

    Order by-clause is a clause mandatory for some analytic funtions and optional for others.Functions like Lag and Lead mandates the use of order by clause as they need an ordered set data to access the next and the previous rows.This clause is mandatory when using a windowing function as explained above.

    Windowing-clause is again an optional clause and is a bit confusing at first.This gives as a way to define a sliding or anchored window of data, on which the function will operate, within a group.There are different keywords associated with this clause such as 'RANGE BETWEEN' 'UNBOUNDED PRECEDING' 'CURRENT ROW' etc.

    We will dwelve more into this clause on my next thread.

    List of Analytic Functions
    AVG
    Returns a running average

    CORR
    Returns the coefficient of correlation of a set of number pairs

    COUNT
    Returns a running count of all records or by partition
    Return a frequency distribution

    COVAR_POP
    Returns the population covariance of a set of number pairs

    COVAR_SAMP
    Returns the sample covariance of a set of number pairs

    CUME_DIST
    Returns the cumulative distribution of a value in a group of values

    DENSE_RANK
    Ranks items in a group leaving no gaps in ranking sequence when there are ties

    FIRST
    Returns the row ranked first using DENSE_RANK

    FIRST_VALUE
    Returns the first value in an ordered set of values. If the first value in the set is null, then the function returns NULL unless you specify IGNORE NULLS

    IGNORE NULLS

    LAG
    LAG provides access to more than one row of a table at the same time without a self-join. Given a series of rows returned from a query and a position of the cursor, LAG provides access to a row at a given physical offset prior to that position

    LAST
    Returns the row ranked last using DENSE_RANK

    LAST_VALUE
    Returns the last value in an ordered set of values. If the last value in the set is null, then the function returns NULL unless you specify IGNORE NULLS. This setting is useful for data densification. If you specify IGNORE NULLS, then LAST_VALUE returns the first non-null value in the set, or NULL if all values are null.

    LEAD
    LEAD provides access to a row at a given physical offset beyond that position
    MAX
    Returns the maximum value by
    partition

    MIN
    Returns the minimum value by partition

    NTILE
    Divides an ordered data set into a number of buckets indicated by expr and assigns the appropriate bucket number to each row. The buckets are numbered 1 through expr. The expr value must resolve to a positive constant for each partition.

    OVER PARTITION BY
    Partitions resultset

    PERCENT_RANK
    For a row r, PERCENT_RANK calculates the rank of r minus 1, divided by 1 less than the number of rows being evaluated (the entire query result set or a partition).

    PERCENTILE_CONT
    Inverse distribution function that assumes a continuous distribution model. It takes a percentile value and a sort specification, and returns an interpolated value that would fall into that percentile value with respect to the sort specification. Nulls are ignored in the calculation.

    PERCENTILE_DISC
    An inverse distribution function that assumes a discrete distribution model. It takes a percentile value and a sort specification and returns an element from the set. Nulls are ignored in the calculation.

    RANK
    Calculates the rank of a value in a group of values

    RATIO_TO_REPORT
    Computes the ratio of a value to the sum of a set of values. If expr evaluates to null, then the ratio-to-report value also evaluates to null.

    REGR_ (Linear Regression) Functions
    more on this on upcoming threads

    STDDEV
    Returns the sample standard deviation of an expression

    STDDEV_POP
    Computes the population standard deviation and returns the square root of the population variance

    STDDEV_SAMP
    Computes the cumulative sample standard deviation and returns the square root of the sample variance.

    SUM
    Computes the cumulative sample running sum

    VAR_POP
    Returns the population variance of a set of numbers

    VAR_SAMP
    Returns the sample variance of a set of numbers

    VARIANCE
    Returns the variance of an expression

    ....To be continued in:

    Analytic Functions Part II, The Windowing Clause

    Analytic Functions III : Conclusion
    Cheers
    :D