Monday, January 12, 2009

How to improve the performance of a Correlated query ?

How to improve the performance of a Correlated query ?

In our regular development life, we come across many queries that has correlated sub queries.

Eg : select ename, sal, deptno
from emp e
where sal > (select avg (sal)
from emp
where deptno = e.deptno
group by deptno);

(also can be written as) :

select *
from (select ename, sal, deptno, avg (sal) over (partition by deptno) avg_sal
from emp)
where sal > avg_sal;

Sometimes, either the query becomes too nested or non-performant. Following blog explains the different approaches that can be used to achieve the same result, with improved performance.

As usual, let's start with a test case.

create table T(
ITEM_NO VARCHAR2(20),
QTY NUMBER,
BIN VARCHAR2(10),
EFFECTIVE_DATE DATE
);

insert into t values ('AC006',10,'DC001',to_date('02/01/2002','mm/dd/yyyy'));
insert into t values ('AC006',20,'DC002',to_date('02/01/2002','mm/dd/yyyy'));
insert into t values ('AC006',100,'DC001',to_date('05/01/2002','mm/dd/yyyy'));
insert into t values ('AC006',50,'DC002' to_date('05/02/2002','mm/dd/yyyy'));
insert into t values ('AC006',30,'DC003',to_date('05/03/2002','mm/dd/yyyy'));
insert into t values ('AC006',20,'DC008',to_date('05/04/2002','mm/dd/yyyy'));
insert into t values ('AC007',10,'DC001',to_date('02/01/2002','mm/dd/yyyy'));
insert into t values ('AC007',20,'DC002',to_date('02/01/2002','mm/dd/yyyy'));
insert into t values ('AC007',77,'DC001',to_date('05/01/2002','mm/dd/yyyy'));
insert into t values ('AC007',32,'DC002',to_date('05/02/2002','mm/dd/yyyy'));
insert into t values ('AC007',52,'DC003',to_date('05/03/2002','mm/dd/yyyy'));
insert into t values ('AC007',33,'DC008', to_date('05/04/2002','mm/dd/yyyy'));

commit;

select * from t;


Our requirement is simple : Get the total qty of an item in an inventory, where the qty should be considered of that item,
which has the highest effective date in a given bin.

-- We can achive this in 3 different ways :

-- #1 : Using co-related sub-query.

-- Table is referred twice .... so less performant ....
select item_no, sum(qty)
from t
where effective_date = (select max(effective_date)
from t t2
where t2.item_no = t.item_no
and t2.bin = t.bin)
group by item_no;


test@VEDALA> set autot trace exp stat;
test@VEDALA> select item_no, sum(qty)
2 from t
3 where effective_date = (select max(effective_date)
4 from t t2
5 where t2.item_no = t.item_no
6 and t2.bin = t.bin)
7 group by item_no;


Execution Plan
----------------------------------------------------------
Plan hash value: 1922650475

-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 41 | 22 (5)| 00:00:01 |
| 1 | HASH GROUP BY | | 1 | 41 | 22 (5)| 00:00:01 |
|* 2 | FILTER | | | | | |
| 3 | TABLE ACCESS FULL | T | 12 | 492 | 3 (0)| 00:00:01 |
| 4 | SORT AGGREGATE | | 1 | 28 | | |
|* 5 | TABLE ACCESS FULL| T | 1 | 28 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - filter("EFFECTIVE_DATE"= (SELECT MAX("EFFECTIVE_DATE") FROM "T"
"T2" WHERE "T2"."ITEM_NO"=:B1 AND "T2"."BIN"=:B2))
5 - filter("T2"."ITEM_NO"=:B1 AND "T2"."BIN"=:B2)

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
25 recursive calls
0 db block gets
136 consistent gets
0 physical reads
620 redo size
520 bytes sent via SQL*Net to client
381 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
3 sorts (memory)
0 sorts (disk)
2 rows processed


-- Approach#2 : Most famous Tom's technique, use max() to avoid co-related sub-query.

select substr(data, 14, 5) item_no, sum(to_number(substr(data, 19))) qty
from (select max(to_char(effective_date, 'YYYYMMDD') || rpad(bin, 5) ||rpad(item_no, 5)||to_char(qty)) data
from t
group by item_no, bin)
group by substr(data, 14, 5);

test@VEDALA> select substr(data, 14, 5) item_no, sum(to_number(substr(data, 19))) qty
2 from (select max(to_char(effective_date, 'YYYYMMDD') || rpad(bin, 5) ||rpad(item_no, 5)||to_char(qty)) data
3 from t
4 group by item_no, bin)
5 group by substr(data, 14, 5);


Execution Plan
----------------------------------------------------------
Plan hash value: 1926056118

-----------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
-----------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12 | 372 | 5 (40)| 00:00:01 |
| 1 | HASH GROUP BY | | 12 | 372 | 5 (40)| 00:00:01 |
| 2 | VIEW | | 12 | 372 | 4 (25)| 00:00:01 |
| 3 | HASH GROUP BY | | 12 | 492 | 4 (25)| 00:00:01 |
| 4 | TABLE ACCESS FULL| T | 12 | 492 | 3 (0)| 00:00:01 |
-----------------------------------------------------------------------------

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
4 recursive calls
0 db block gets
17 consistent gets
0 physical reads
172 redo size
515 bytes sent via SQL*Net to client
381 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
2 rows processed

-- Approach#3 : Analytics - easy to read and performs well .. compare the plans

select item_no, sum(qty)
from (select distinct item_no,
bin,
first_value(qty) over(partition by item_no, bin order by effective_date desc) qty
from t)
group by item_no;

test@VEDALA> select item_no, sum(qty)
2 from (select distinct item_no,
3 bin,
4 first_value(qty) over(partition by item_no, bin order by effective_date desc) qty
5 from t)
6 group by item_no;

Execution Plan
----------------------------------------------------------
Plan hash value: 1317671949

------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 12 | 300 | 5 (40)| 00:00:01 |
| 1 | HASH GROUP BY | | 12 | 300 | 5 (40)| 00:00:01 |
| 2 | VIEW | | 12 | 300 | 5 (40)| 00:00:01 |
| 3 | HASH UNIQUE | | 12 | 492 | 5 (40)| 00:00:01 |
| 4 | WINDOW SORT | | 12 | 492 | 5 (40)| 00:00:01 |
| 5 | TABLE ACCESS FULL| T | 12 | 492 | 3 (0)| 00:00:01 |
------------------------------------------------------------------------------

Note
-----
- dynamic sampling used for this statement


Statistics
----------------------------------------------------------
4 recursive calls
0 db block gets
17 consistent gets
0 physical reads
172 redo size
520 bytes sent via SQL*Net to client
381 bytes received via SQL*Net from client
2 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
2 rows processed

Correlated query was costing 22 where as approach #2 or #3 costs only #5.

source : AskTom.

Long live "Tom Kyte".

Good Luck,
r-a-v-i