Académique Documents
Professionnel Documents
Culture Documents
Extended Statistics
ChinarAliyev
As you know in oracle 11g introduced extended statistics to improve selectivity estimation of correlated columns.
But when and how query optimizer (QO) use these statistics? What are its restrictions? Lets see step by step
Correlated Columns
We will use customers table in SH schema
7
);
8 END;
9 /
SQL> SELECT *
2
FROM customers a
3
where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790;
Execution Plan
---------------------------------------------------------Plan hash value: 2008213504
------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
| 1115 |
197K|
406
(1)| 00:00:05 |
|* 1 | TABLE ACCESS FULL| CUSTOMERS | 1115 |
197K|
406
(1)| 00:00:05 |
------------------------------------------------------------------------------Predicate Information (identified by operation id):
--------------------------------------------------1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)
SQL>
Even we use histograms QO cannot estimate correct cardinality due to here is column correlation. To
solving this problem RDBMS should be gather statistics for combined (concatenate) of these columns.
Like one of the most statistic is NDV (number of distinct values), to find and storing this statistic in data
dictionary RDBMS should be analyze both columns together (number of distinct row groups) like below.
But just NDV is not enough to estimate correct cardinality because there can be skew data for
concatenated correlation columns. Still we talking about just simplewhere col1=a1 and col2=a2 and
Coln=an(p1) predicate, QO should be estimate selectivity of this column groups which contain skewed
data. Therefore in Oracle RDBMS these column groups (which are correlated) mapped as equivalent
virtual column then according this virtual column statistics QO use to estimate selectivity of predicate p1.
So what is happen when creating extended statistics? To create this you can useCREATE_EXTENDED_STATS
orGATHER_TABLE_STATSwithMETHOD_OPToption ofDBMS_STATSpackage. In our example
cust_state_province and country_id columns are correlated then we can create column groups for
these column as:
SQL> begin
2
DBMS_STATS.GATHER_TABLE_STATS (
3
'SH',
4
'CUSTOMERS',
5
estimate_percent=>null,
6
METHOD_OPT =>'FOR COLUMNS (CUST_STATE_PROVINCE,
COUNTRY_ID) size 1');
7 end;
8 /
PL/SQL procedure successfully completed.
SQL>
When creating column groups if we enable sql trace then we can be found below statement.
Alter table "SH"."CUSTOMERS" add (SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ as
(sys_op_combined_hash (CUST_STATE_PROVINCE, COUNTRY_ID)) virtual BY USER for
statistics)
It means when creating extended statistics oracle first add virtual column, and then gather statistics for
this column. Why use hash function? We just see p1 (known as Point correlation). It means every col1,
col2 coln columns values must according only one unique value when virtual column creating. So
here must be Y=F (col1, col2 coln) relationship between correlated columns and virtual columns
therefore if this exists then QO can estimate selectivity of these column groups. Every input values
function F must generate unique values. Of course just columns concatenate can be enough for most of
cases but using hash function provides guarantee unique values and this is best option. Now we have
one column group without histogram and separately column has histograms. Lets see what is happen in
this case for query (Q1)
As you see QO detected column groups but it is not use virtual column statistics due to in this case QO
is not estimate selectivity(in trace file Partial for this column group(CG) is null). It use traditional
method and estimated selectivity.
HISTOGRAM
--------------FREQUENCY
FREQUENCY
SQL>
Endpoint_Actual_value
Brittany
Buenos Aires
CA
CO
CT
CUST_STATE_PROVINCE
COUNTRY_ID
Endpoint_value
52786
52787
52788
52789
52790
52791
Sel(cust_state_province)=(11321-7980) /55500=3341/55500=0.06019
Sel(country_id)=(55412-36892) /55500=18520/55500=0.33369
Sel(cust_state_province and country_id)= 0.06019*0.33369=0.02008
Card=num_rows*sel=55500*0.02008=1114.706~1115
Now lets gather histogram statistics for this column group and see what is happen.
SQL> BEGIN
2
DBMS_STATS.gather_table_stats
3
('SH',
4
'CUSTOMERS',
5
estimate_percent
=> NULL,
6
method_opt
=> 'FOR COLUMNS (CUST_STATE_PROVINCE,COUNTRY_ID) size
skewonly'
7
);
8 END;
9 /
SQL> select column_name,num_distinct,histogram from user_tab_col_statistics
2 where table_name='CUSTOMERS'
3 and column_name='SYS_STU#S#WF25Z#QAHIHE#MOFFMM_';
COLUMN_NAME
NUM_DISTINCT HISTOGRAM
------------------------------ ------------ --------------SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
145 FREQUENCY
SQL>
Execution Plan
---------------------------------------------------------Plan hash value: 2008213504
------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
| 3341 |
629K|
406
(1)| 00:00:05 |
|* 1 | TABLE ACCESS FULL| CUSTOMERS | 3341 |
629K|
406
(1)| 00:00:05 |
------------------------------------------------------------------------------Predicate Information (identified by operation id):
--------------------------------------------------1 - filter("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)
SQL>
As you see in this case QO estimate cardinality correctly and from trace file
How QO estimate selectivity? This is frequency histogram, when creating histogram if we enable sql
trace then we can see oracle build histogram using below statement.
Consider for column group CG1 gathered statistics but for CG2 is not. So CG2 to be estimate
selectivity. In this case execution plan was
SQL>
2
3
(Q2)
SELECT
*
FROM customers a
where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and cust_city_id=51919;
endpoint_value
51916
51917
51919
51924
51930
51934
51971
sel(CUST_CITY_ID)=(161-158)/num_buckets=0.0118110236
sel(cg2)=sel(cg1)*sel(CUST_CITY_ID)=7.1102362204724409448818897637795e-4
card=sel(cg2)*num_rows=39.46
Actually this is clear method, without CG2 QO will estimate cardinality also as 39 because our predicate is
CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and cust_city_id=51919 then optimizer detected here is
column group and have sufficient statistics therefore we can write predicate SYS_STU#S#WF25Z#QAHIHE#MOFFMM_=
MOD (sys_op_combined_hash ('CA', 52790), 9999999999) and cust_city_id=51919 Due to selectivity will be
sel (SYS_STU#S#WF25Z#QAHIHE#MOFFMM_)*sel (cust_city_id)
CG1 = ("CUST_STATE_PROVINCE","COUNTRY_ID","CUST_CITY_ID")
CG2 = ("CUST_STATE_PROVINCE","COUNTRY_ID")
CG3 = ("CUST_STATE_PROVINCE","CUST_CITY_ID")
Or
Sel (CG1) =sel (CG3)*sel (COUNTRY_ID)
(F2)
So which formula QO will choose and what based on? Lets see execution plan and trace file
Extension_name
SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
SYS_STULHUROKG217F9$OWA1IEIZLA
Extension
("CUST_STATE_PROVINCE","COUNTRY_ID")
("CUST_STATE_PROVINCE","CUST_CITY_ID")
Histogram
FREQUENCY
HEIGHT BALANCED
CorStrengh(col1,col2,,coln)=NDV(col1)*NDV(col2)**NDV(coln)/NDV(col1,col2,,coln)
CorStrengh (cust_state_province, cust_city_id) = 145*620/620=145
CorStrengh (cust_state_province, country_id) = 145*19/145=19
Endpoint_value
52789
52790
52791
Therefore
sel(country _id)= (55412-36892)/55500= 0.33369.
sel(p2)=sel()*sel(country_id)= 0.00393758
Card=sel (p2)*num_rows= 0.00393758*55500=218.536~219
create table t
as
select
trunc(dbms_random.value(0,25)) n1,
trunc(dbms_random.value(0,20)) n2,
lpad(rownum,10,'0') small_vc
from
all_objects
where
rownum <= 10000
;
Table created.
SQL> update t set n2=n1 where rownum<=9955;
9955 rows updated.
SQL> commit;
Commit complete.
SQL>
2
3
4
5
6
7
8
9
10
begin
dbms_stats.gather_table_stats(
user,
't',
cascade => true,
estimate_percent=>null,
method_opt=> 'for all columns size 1 FOR COLUMNS (n1,n2) size 1 ');
end;
/
SQL>
2
3
4
5
6
7
8
9
select
count(*)
from
t t1,
t t2
where
t1.n1 = t2.n1
and t1.n2 = t2.n2
;
Execution Plan
---------------------------------------------------------Plan hash value: 791582492
---------------------------------------------------------------------------| Id | Operation
| Name | Rows | Bytes | Cost (%CPU)| Time
|
---------------------------------------------------------------------------|
0 | SELECT STATEMENT
|
|
1 |
12 |
32 (25)| 00:00:01 |
|
1 | SORT AGGREGATE
|
|
1 |
12 |
|
|
|* 2 |
HASH JOIN
|
| 1470K|
16M|
32 (25)| 00:00:01 |
|
3 |
TABLE ACCESS FULL| T
| 10000 | 60000 |
12
(0)| 00:00:01 |
|
4 |
TABLE ACCESS FULL| T
| 10000 | 60000 |
12
(0)| 00:00:01 |
I do not full text of trace file here but some need information here.
Therefore
Projections
QO can estimate cardinality using extended statistics during GROUP BY operation.
| Operation
| Name
| Rows
-------------------------------------------------------------------------------|
0 | SELECT STATEMENT
1949 | 31184 |
408
(1)| 00:00:05 |
1 |
2 |
HASH GROUP BY
1949 | 31184 |
867K|
408
(1)| 00:00:05 |
406
(1)| 00:00:05 |
--------------------------------------------------------------------------------
SQL>
'SH',
'CUSTOMERS',
estimate_percent=>null,
5
1');
end;
group by CUST_STATE_PROVINCE,COUNTRY_ID;
Execution Plan
---------------------------------------------------------Plan hash value: 1577413243
------------------------------------------------------------------------------| Id
| Operation
| Name
| Rows
------------------------------------------------------------------------------|
0 | SELECT STATEMENT
145 | 2320 |
1 |
145 |
2320 |
408
(1)| 00:00:05
2 |
867K|
406
(1)| 00:00:05
HASH GROUP BY
408
(1)| 00:00:05
as
select
trunc(dbms_random.value(0,25)) p1,
trunc(dbms_random.value(0,20)) p2,
lpad(rownum,10,'0') padding
from
all_objects
where
10
11
Table created.
SQL>
SQL> begin
2
dbms_stats.gather_table_stats(
user,
't_candidate',
estimate_percent=>null,
end;
t_candidate where
COUNT(*)
---------19
SQL>
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
SQL_ID
t_candidate where
| Operation
| Name
| Rows
|
-------------------------------------------------------------------------------|
0 | SELECT STATEMENT
12 (100)|
1 |
1 |
6 |
|*
2 |
20 |
120 |
SORT AGGREGATE
12
(0)| 00:00:01
SQL>
SQL>
SQL>
DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
--------------------------------------------------------------------------------------------------LEGEND:
.......
EQ
RANGE
LIKE
NULL
EQ_JOIN
JOIN
DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
---------------------------------------------------------------------------------------------------
GROUP_BY
...............................................................................
###############################################################################
COLUMN USAGE REPORT FOR SH.T_CANDIDATE
......................................
1. P1
: EQ
2. P2
: EQ
3. (P1, P2)
: FILTER
DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
SQL> select dbms_stats.create_extended_stats('SH','t_candidate') from dual;
DBMS_STATS.CREATE_EXTENDED_STATS('SH','T_CANDIDATE')
--------------------------------------------------------------------------------------------------###############################################################################
1. (P1, P2)
: SYS_STUIV1F__U9NUVZ7#MDKL81$SY created
###############################################################################
SQL> exec dbms_stats.gather_table_stats('SH','t_candidate',method_opt=>'for all columns size
skewonly for columns (p1,p2) size skewonly');
PL/SQL procedure successfully completed.
SQL> select column_name,num_distinct,histogram from user_tab_col_statistics where
table_name='T_CANDIDATE';
COLUMN_NAME
NUM_DISTINCT HISTOGRAM
25 FREQUENCY
P2
20 FREQUENCY
PADDING
SYS_STUIV1F__U9NUVZ7#MDKL81$SY
SQL> select count(*) from
COUNT(*)
----------
500 NONE
t_candidate where
19
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
-----------------------------------------------------------SQL_ID
t_candidate where
| Operation
| Name
| Rows
0 | SELECT STATEMENT
12 (100)|
1 |
1 |
6 |
|*
2 |
20 |
120 |
SORT AGGREGATE
12
(0)| 00:00:01
So this method is not discover only correlated, result is: candidate columns also contain noncorrelated/independent columns.
Also QO can use column groups statistics through composite index without obviously adding to data dictionary.
In this case selectivity will calculate based on DISTINCT_KEYS of this index (but I fully not investigate that).
Another question can be related sql profile (SQP) and correlation data. Of course if here is columns correlation
then you can use SQP if in result of sql tuning adviser task appear accepting sql profile. SQP is collection of
internal hints (like opt_estimate), using offline optimization method it estimate selectivity/cardinality accurately
and give information for online optimizer to choosing best plan. Finally note that still Oracle`s QO cannot
(estimate selectivity of correlated columns) use extended statistics for non-equal, range and out of bound
predicate, may be in such cases need additional statistics(also gathering method) and will solve future releases.