Vous êtes sur la page 1sur 16

Interpreting

Extended Statistics
ChinarAliyev

As you know in oracle 11g introduced extended statistics to improve selectivity estimation of correlated columns.
But when and how query optimizer (QO) use these statistics? What are its restrictions? Lets see step by step
Correlated Columns
We will use customers table in SH schema

SQL> select count (*) from customers


2 where CUST_STATE_PROVINCE=CA and COUNTRY_ID=52790;
COUNT (*)
---------3341
SQL>

Without histogram QO estimate cardinality as below


SQL>
BEGIN
2
DBMS_STATS.gather_table_stats ('SH',
3
'CUSTOMERS',
4
5
estimate_percent=>null,
6
cascade => true,
7
method_opt=>'FOR ALL COLUMNS SIZE 1'
8
);
9
END;
10
/
SQL> SELECT *
2
FROM customers a
3
where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790;
(Q1)
Execution Plan
---------------------------------------------------------Plan hash value: 2008213504
------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
|
20 | 3620 |
406
(1)| 00:00:05 |
|* 1 | TABLE ACCESS FULL| CUSTOMERS |
20 | 3620 |
406
(1)| 00:00:05 |
------------------------------------------------------------------------------Predicate Information (identified by operation id):
--------------------------------------------------1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)
SQL>

With histograms QO estimate cardinality as below


SQL> BEGIN
2
DBMS_STATS.gather_table_stats ('SH',
3
'CUSTOMERS',
4
estimate_percent=>null,
5
cascade => true,
6
method_opt=>'FOR ALL COLUMNS SIZE skewonly'

7
);
8 END;
9 /
SQL> SELECT *
2
FROM customers a
3
where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790;
Execution Plan
---------------------------------------------------------Plan hash value: 2008213504
------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
| 1115 |
197K|
406
(1)| 00:00:05 |
|* 1 | TABLE ACCESS FULL| CUSTOMERS | 1115 |
197K|
406
(1)| 00:00:05 |
------------------------------------------------------------------------------Predicate Information (identified by operation id):
--------------------------------------------------1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)
SQL>

Even we use histograms QO cannot estimate correct cardinality due to here is column correlation. To
solving this problem RDBMS should be gather statistics for combined (concatenate) of these columns.
Like one of the most statistic is NDV (number of distinct values), to find and storing this statistic in data
dictionary RDBMS should be analyze both columns together (number of distinct row groups) like below.

SQL> SELECT COUNT (*) ndv


2
FROM (SELECT
cust_state_province, country_id
3
FROM customers
4
GROUP BY cust_state_province, country_id)
5
SQL> /
NDV
---------145
SQL>

But just NDV is not enough to estimate correct cardinality because there can be skew data for
concatenated correlation columns. Still we talking about just simplewhere col1=a1 and col2=a2 and
Coln=an(p1) predicate, QO should be estimate selectivity of this column groups which contain skewed
data. Therefore in Oracle RDBMS these column groups (which are correlated) mapped as equivalent
virtual column then according this virtual column statistics QO use to estimate selectivity of predicate p1.
So what is happen when creating extended statistics? To create this you can useCREATE_EXTENDED_STATS
orGATHER_TABLE_STATSwithMETHOD_OPToption ofDBMS_STATSpackage. In our example
cust_state_province and country_id columns are correlated then we can create column groups for
these column as:
SQL> begin
2
DBMS_STATS.GATHER_TABLE_STATS (
3
'SH',
4
'CUSTOMERS',
5
estimate_percent=>null,
6
METHOD_OPT =>'FOR COLUMNS (CUST_STATE_PROVINCE,
COUNTRY_ID) size 1');
7 end;
8 /
PL/SQL procedure successfully completed.
SQL>

When creating column groups if we enable sql trace then we can be found below statement.
Alter table "SH"."CUSTOMERS" add (SYS_STU#S#WF25Z#QAHIHE#MOFFMM_ as
(sys_op_combined_hash (CUST_STATE_PROVINCE, COUNTRY_ID)) virtual BY USER for
statistics)

It means when creating extended statistics oracle first add virtual column, and then gather statistics for
this column. Why use hash function? We just see p1 (known as Point correlation). It means every col1,
col2 coln columns values must according only one unique value when virtual column creating. So
here must be Y=F (col1, col2 coln) relationship between correlated columns and virtual columns
therefore if this exists then QO can estimate selectivity of these column groups. Every input values
function F must generate unique values. Of course just columns concatenate can be enough for most of
cases but using hash function provides guarantee unique values and this is best option. Now we have
one column group without histogram and separately column has histograms. Lets see what is happen in
this case for query (Q1)

Plan hash value: 2008213504


------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
| 1115 |
210K|
406
(1)| 00:00:05 |
|* 1 | TABLE ACCESS FULL| CUSTOMERS | 1115 |
210K|
406
(1)| 00:00:05 |
------------------------------------------------------------------------------Predicate Information (identified by operation id):
--------------------------------------------------1 - filter ("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)

And from trace file

SINGLE TABLE ACCESS PATH


Single Table Cardinality Estimation for CUSTOMERS[A]
Column (#11):
NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145,
NDV:145
Column (#11): CUST_STATE_PROVINCE (
AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144
Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145
Column (#13):
NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19,
NDV:19
Column (#13): COUNTRY_ID(
AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791
Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19
Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_(
AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.006897
ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
Col#: 11 13
CorStregth: 19.00
ColGroup Usage:: PredCnt: 2 Matches Full: Partial:
Table: CUSTOMERS Alias: A
Card: Original: 55500.000000 Rounded: 1115 Computed: 1114.87 Non Adjusted: 1114.87
Access Path: TableScan
Cost: 405.71 Resp: 405.71 Degree: 0
Cost_io: 404.00 Cost_cpu: 35392510
Resp_io: 404.00 Resp_cpu: 35392510

As you see QO detected column groups but it is not use virtual column statistics due to in this case QO
is not estimate selectivity(in trace file Partial for this column group(CG) is null). It use traditional
method and estimated selectivity.

SQL> select num_rows,blocks from user_tables where table_name='CUSTOMERS';


NUM_ROWS
BLOCKS
---------- ---------55500
1486
SQL> select num_distinct, histogram from user_tab_col_statistics
2 where table_name='CUSTOMERS'
3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID');
NUM_DISTINCT
-----------145
19

HISTOGRAM
--------------FREQUENCY
FREQUENCY

SQL>

From Histogram for column


Endpoint_number
7650
7980
11321
12098
12255

Endpoint_Actual_value
Brittany
Buenos Aires
CA
CO
CT

From Histogram for column


Endpoint_number
29169
29244
29335
36892
55412
55500

CUST_STATE_PROVINCE

COUNTRY_ID

Endpoint_value
52786
52787
52788
52789
52790
52791

They are frequency histograms. Therefore selectivity will be:

Sel(cust_state_province)=(11321-7980) /55500=3341/55500=0.06019
Sel(country_id)=(55412-36892) /55500=18520/55500=0.33369
Sel(cust_state_province and country_id)= 0.06019*0.33369=0.02008
Card=num_rows*sel=55500*0.02008=1114.706~1115

Now lets gather histogram statistics for this column group and see what is happen.
SQL> BEGIN
2
DBMS_STATS.gather_table_stats
3
('SH',
4
'CUSTOMERS',
5
estimate_percent
=> NULL,
6
method_opt
=> 'FOR COLUMNS (CUST_STATE_PROVINCE,COUNTRY_ID) size
skewonly'
7
);
8 END;
9 /
SQL> select column_name,num_distinct,histogram from user_tab_col_statistics
2 where table_name='CUSTOMERS'
3 and column_name='SYS_STU#S#WF25Z#QAHIHE#MOFFMM_';

COLUMN_NAME
NUM_DISTINCT HISTOGRAM
------------------------------ ------------ --------------SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
145 FREQUENCY
SQL>
Execution Plan
---------------------------------------------------------Plan hash value: 2008213504
------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
| 3341 |
629K|
406
(1)| 00:00:05 |
|* 1 | TABLE ACCESS FULL| CUSTOMERS | 3341 |
629K|
406
(1)| 00:00:05 |
------------------------------------------------------------------------------Predicate Information (identified by operation id):
--------------------------------------------------1 - filter("CUST_STATE_PROVINCE"='CA' AND "COUNTRY_ID"=52790)
SQL>

As you see in this case QO estimate cardinality correctly and from trace file

SINGLE TABLE ACCESS PATH


Single Table Cardinality Estimation for CUSTOMERS[A]
Column (#11):
NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145,
NDV:145
Column (#11): CUST_STATE_PROVINCE(
AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144
Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145
Column (#13):
NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19,
NDV:19
Column (#13): COUNTRY_ID(
AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791
Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19
Column (#24):
NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145,
NDV:145
Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_(
AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766
Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145
ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
Col#: 11 13
CorStregth: 19.00
ColGroup Usage:: PredCnt: 2 Matches Full: #1 Partial: Sel: 0.0602
Table: CUSTOMERS Alias: A
Card: Original: 55500.000000 Rounded: 3341 Computed: 3341.00 Non Adjusted: 3341.00
Access Path: TableScan
Cost: 405.74 Resp: 405.74 Degree: 0
Cost_io: 404.00 Cost_cpu: 35837710
Resp_io: 404.00 Resp_cpu: 35837710

How QO estimate selectivity? This is frequency histogram, when creating histogram if we enable sql
trace then we can see oracle build histogram using below statement.

select substrb(dump(val,16,0,32),1,120) ep, cnt


from
(select /*+ no_expand_table(t) index_rs(t)
no_parallel(t)
no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl
dynamic_sampling(0) no_monitoring no_substrb_pad

*/mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999) val,count(*) cnt from


"SH"."CUST" t where mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999) is
not null group by mod("SYS_STU#S#WF25Z#QAHIHE#MOFFMM_",9999999999)) order
by val

Therefore we can use below sql and result can be


SELECT endpoint_number e
FROM user_tab_histograms
WHERE table_name = 'CUSTOMERS'
AND column_name = 'SYS_STU#S#WF25Z#QAHIHE#MOFFMM_'
AND endpoint_value = MOD (sys_op_combined_hash ('CA', 52790), 9999999999))
column_name
endpoint_number
endpoint_value
SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
20225
4701058945
SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
21244
4752431017
SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
24585
4800861232
SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
24813
4861997875
Sel=(24585-21244)/55500=0.06019
Card=num_rows*sel=3341

Two Column Groups Case 1


Now assume we have two column groups.
CG1= (cust_state_province, country_id)
CG2= (CUST_CITY_ID, cust_state_province, country_id)

Consider for column group CG1 gathered statistics but for CG2 is not. So CG2 to be estimate
selectivity. In this case execution plan was
SQL>
2
3
(Q2)

SELECT

*
FROM customers a
where CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and cust_city_id=51919;

Execution PlanPlan hash value: 2008213504


------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
|
39 | 8424 |
406
(1)| 00:00:05 |
|* 1 | TABLE ACCESS FULL| CUSTOMERS |
39 | 8424 |
406
(1)| 00:00:05 |
------------------------------------------------------------------------------Predicate Information (identified by operation id):
--------------------------------------------------1 - filter("CUST_CITY_ID"=51919 AND "CUST_STATE_PROVINCE"='CA' AND
"COUNTRY_ID"=52790)
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for CUSTOMERS[A]
Column (#11):
NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145,
NDV:145
Column (#11): CUST_STATE_PROVINCE(
AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144
Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145
Column (#13):
NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19,
NDV:19
Column (#13): COUNTRY_ID(
AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791

Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19


Column (#10):
NewDensity:0.001189, OldDensity:0.002179 BktCnt:254, PopBktCnt:77, PopValCnt:34, NDV:620
Column (#10): CUST_CITY_ID(
AvgLen: 5 NDV: 620 Nulls: 0 Density: 0.001189 Min: 51040 Max: 52531
Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 212
Column (#26): SYS_STU4RAPXUESG1VO3#Q7ZH365D7( NO STATISTICS (using defaults)
AvgLen: 13 NDV: 1734 Nulls: 0 Density: 0.000577
Column (#25):
NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145,
NDV:145
Column (#25): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_(
AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766
Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145
ColGroup (#1, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
Col#: 11 13
CorStregth: 19.00
ColGroup Usage:: PredCnt: 3 Matches Full: #1 Partial: Sel: 0.0602
Table: CUSTOMERS Alias: A
Card: Original: 55500.000000 Rounded: 39 Computed: 39.46 Non Adjusted: 39.46
Access Path: TableScan
Cost: 405.70 Resp: 405.70 Degree: 0
Cost_io: 404.00 Cost_cpu: 35045008
Resp_io: 404.00 Resp_cpu: 35045008

For column group CG1 we already know how calculated and


sel(CG2)=sel(CG1)*sel(cust_city_id)
SQL> select column_name,num_distinct,histogram from user_tab_col_statistics
2 where table_name='CUSTOMERS'
3 and column_name='CUST_CITY_ID';
COLUMN_NAME
NUM_DISTINCT HISTOGRAM
------------------------------ ------------ --------------CUST_CITY_ID
620 HEIGHT BALANCED
SQL>

From the histogram information for this column


Endpoint_number
157
158
161
162
163
165
166

endpoint_value
51916
51917
51919
51924
51930
51934
51971

sel(CUST_CITY_ID)=(161-158)/num_buckets=0.0118110236
sel(cg2)=sel(cg1)*sel(CUST_CITY_ID)=7.1102362204724409448818897637795e-4
card=sel(cg2)*num_rows=39.46

Actually this is clear method, without CG2 QO will estimate cardinality also as 39 because our predicate is
CUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and cust_city_id=51919 then optimizer detected here is
column group and have sufficient statistics therefore we can write predicate SYS_STU#S#WF25Z#QAHIHE#MOFFMM_=
MOD (sys_op_combined_hash ('CA', 52790), 9999999999) and cust_city_id=51919 Due to selectivity will be
sel (SYS_STU#S#WF25Z#QAHIHE#MOFFMM_)*sel (cust_city_id)

Two Column Groups Case 2


Assume we have three column groups like below

CG1 = ("CUST_STATE_PROVINCE","COUNTRY_ID","CUST_CITY_ID")
CG2 = ("CUST_STATE_PROVINCE","COUNTRY_ID")
CG3 = ("CUST_STATE_PROVINCE","CUST_CITY_ID")

And our predicate isCUST_STATE_PROVINCE='CA' and COUNTRY_ID=52790 and CUST_CITY=51919 (P2). So


in this case how QO will estimate selectivity? There sel (CG1) to be estimated, CG1 has not statistics.
But for two column group CG2 and CG3 statistics were gathered. According previous example
selectivity of CG1 can be estimated as below

Sel (CG1) =sel (CG2)*sel (CUST_CITY_ID) (F1)

Or
Sel (CG1) =sel (CG3)*sel (COUNTRY_ID)

(F2)

So which formula QO will choose and what based on? Lets see execution plan and trace file

SQL> select * from customers


2 where CUST_STATE_PROVINCE='CA' and CUST_CITY_ID=51919 and COUNTRY_ID=52790;
Execution Plan
---------------------------------------------------------Plan hash value: 2008213504
------------------------------------------------------------------------------| Id | Operation
| Name
| Rows | Bytes | Cost (%CPU)| Time
|
------------------------------------------------------------------------------|
0 | SELECT STATEMENT |
|
219 | 43800 |
406
(1)| 00:00:05 |
|* 1 | TABLE ACCESS FULL| CUSTOMERS |
219 | 43800 |
406
(1)| 00:00:05 |
------------------------------------------------------------------------------Predicate Information (identified by operation id):
--------------------------------------------------1 - filter("CUST_CITY_ID"=51919 AND "CUST_STATE_PROVINCE"='CA' AND
"COUNTRY_ID"=52790)
SQL>
SINGLE TABLE ACCESS PATH
Single Table Cardinality Estimation for CUSTOMERS[CUSTOMERS]
Column (#13):
NewDensity:0.000676, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:19,
NDV:19
Column (#13): COUNTRY_ID(
AvgLen: 5 NDV: 19 Nulls: 0 Density: 0.000676 Min: 52769 Max: 52791
Histogram: Freq #Bkts: 19 UncompBkts: 55500 EndPtVals: 19
Column (#11):
NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145,
NDV:145
Column (#11): CUST_STATE_PROVINCE(
AvgLen: 11 NDV: 145 Nulls: 0 Density: 0.000144
Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145
Column (#10):
NewDensity:0.001189, OldDensity:0.002179 BktCnt:254, PopBktCnt:77, PopValCnt:34, NDV:620
Column (#10): CUST_CITY_ID(
AvgLen: 5 NDV: 620 Nulls: 0 Density: 0.001189 Min: 51040 Max: 52531
Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 212
Column (#26): SYS_STU14HX98$V3_$3Z$ZSWQ0O8O0(
AvgLen: 12 NDV: 620 Nulls: 0 Density: 0.001613
Column (#25):
NewDensity:0.001277, OldDensity:0.002351 BktCnt:254, PopBktCnt:62, PopValCnt:28, NDV:620
Column (#25): SYS_STULHUROKG217F9$OWA1IEIZLA(
AvgLen: 12 NDV: 620 Nulls: 0 Density: 0.001277 Min: 29269004 Max: 9981124071

Histogram: HtBal #Bkts: 254 UncompBkts: 254 EndPtVals: 221


Column (#24):
NewDensity:0.000144, OldDensity:0.000009 BktCnt:55500, PopBktCnt:55500, PopValCnt:145,
NDV:145
Column (#24): SYS_STU#S#WF25Z#QAHIHE#MOFFMM_(
AvgLen: 12 NDV: 145 Nulls: 0 Density: 0.000144 Min: 22231259 Max: 9992664766
Histogram: Freq #Bkts: 145 UncompBkts: 55500 EndPtVals: 145
ColGroup (#1, VC) SYS_STU14HX98$V3_$3Z$ZSWQ0O8O0
Col#: 10 11 13
CorStregth: 2755.00
ColGroup (#2, VC) SYS_STULHUROKG217F9$OWA1IEIZLA
Col#: 10 11
CorStregth: 145.00
ColGroup (#3, VC) SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
Col#: 11 13
CorStregth: 19.00
ColGroup Usage:: PredCnt: 3 Matches Full: #2 Partial: Sel: 0.0118
Table: CUSTOMERS Alias: CUSTOMERS
Card: Original: 55500.000000 Rounded: 219 Computed: 218.74 Non Adjusted: 218.74
Access Path: TableScan

Extension_name
SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
SYS_STULHUROKG217F9$OWA1IEIZLA

Extension
("CUST_STATE_PROVINCE","COUNTRY_ID")
("CUST_STATE_PROVINCE","CUST_CITY_ID")

Histogram
FREQUENCY
HEIGHT BALANCED

SQL> select column_name, num_distinct from user_tab_col_statistics


2 where table_name='CUSTOMERS'
3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID','CUST_CITY_ID',
4 SYS_STULHUROKG217F9$OWA1IEIZLA','SYS_STU#S#WF25Z#QAHIHE#MOFFMM_')
5;
COLUMN_NAME
NUM_DISTINCT HISTOGRAM
------------------------------ ------------ --------------SYS_STU#S#WF25Z#QAHIHE#MOFFMM_
145 FREQUENCY
SYS_STULHUROKG217F9$OWA1IEIZLA
620 HEIGHT BALANCED
CUST_CITY_ID
620 HEIGHT BALANCED
CUST_STATE_PROVINCE
145 FREQUENCY
COUNTRY_ID
19 FREQUENCY

As you see QO chooseSYS_STULHUROKG217F9$OWA1IEIZLAvirtual column (Matches Full: #2) why this?


Because this CG correlation strength is more than other (145>19). CorStength is indicating deeply of
correlation between columns (in column groups). It seems QO identify correlation strength using
NDV`s .So

CorStrengh(col1,col2,,coln)=NDV(col1)*NDV(col2)**NDV(coln)/NDV(col1,col2,,coln)
CorStrengh (cust_state_province, cust_city_id) = 145*620/620=145
CorStrengh (cust_state_province, country_id) = 145*19/145=19

From histogram for country_id


Endpoint_number
36892
55412
55500

Endpoint_value
52789
52790
52791

Therefore
sel(country _id)= (55412-36892)/55500= 0.33369.
sel(p2)=sel()*sel(country_id)= 0.00393758
Card=sel (p2)*num_rows= 0.00393758*55500=218.536~219

IfCorStrengh(col1,col2,,coln)=1 it mean is they col1,col2,,coln columns are not correlated.

Extended Statistics and equijoin


QO also can detect extended statistics in equijoin operation and can use this.
SQL>
2
3
4
5
6
7
8
9
10
11

create table t
as
select
trunc(dbms_random.value(0,25)) n1,
trunc(dbms_random.value(0,20)) n2,
lpad(rownum,10,'0') small_vc
from
all_objects
where
rownum <= 10000
;

Table created.
SQL> update t set n2=n1 where rownum<=9955;
9955 rows updated.
SQL> commit;
Commit complete.
SQL>
2
3
4
5
6
7
8
9
10

begin
dbms_stats.gather_table_stats(
user,
't',
cascade => true,
estimate_percent=>null,
method_opt=> 'for all columns size 1 FOR COLUMNS (n1,n2) size 1 ');
end;
/

PL/SQL procedure successfully completed.

SQL>
2
3
4
5
6
7
8
9

select
count(*)
from
t t1,
t t2
where
t1.n1 = t2.n1
and t1.n2 = t2.n2
;

Execution Plan
---------------------------------------------------------Plan hash value: 791582492
---------------------------------------------------------------------------| Id | Operation
| Name | Rows | Bytes | Cost (%CPU)| Time
|
---------------------------------------------------------------------------|
0 | SELECT STATEMENT
|
|
1 |
12 |
32 (25)| 00:00:01 |
|
1 | SORT AGGREGATE
|
|
1 |
12 |
|
|
|* 2 |
HASH JOIN
|
| 1470K|
16M|
32 (25)| 00:00:01 |
|
3 |
TABLE ACCESS FULL| T
| 10000 | 60000 |
12
(0)| 00:00:01 |
|
4 |
TABLE ACCESS FULL| T
| 10000 | 60000 |
12
(0)| 00:00:01 |

---------------------------------------------------------------------------Predicate Information (identified by operation id):


--------------------------------------------------2 - access("T1"."N1"="T2"."N1" AND "T1"."N2"="T2"."N2")

I do not full text of trace file here but some need information here.

SINGLE TABLE ACCESS PATH


Single Table Cardinality Estimation for T[T2]
Table: T Alias: T2
Card: Original: 10000.000000 Rounded: 10000 Computed: 10000.00 Non Adjusted: 10000.00
Access Path: TableScan
Cost: 12.09 Resp: 12.09 Degree: 0
Cost_io: 12.00 Cost_cpu: 1956372
Resp_io: 12.00 Resp_cpu: 1956372
Best:: AccessPath: TableScan
Cost: 12.09 Degree: 1 Resp: 12.09 Card: 10000.00 Bytes: 0
Column (#4): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(
AvgLen: 12 NDV: 68 Nulls: 0 Density: 0.014706
ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
Col#: 1 2
CorStregth: 9.19
Column (#4): SYS_STUBZH0IHA7K$KEBJVXO5LOHAS(
AvgLen: 12 NDV: 68 Nulls: 0 Density: 0.014706
ColGroup (#1, VC) SYS_STUBZH0IHA7K$KEBJVXO5LOHAS
Col#: 1 2
CorStregth: 9.19
Join ColGroups for T[T1] and T[T2] : (#1, #1)

Therefore

Join selectivity will be 0.014706 and final cardinality 0.014706*10000*10000=1470600

Projections
QO can estimate cardinality using extended statistics during GROUP BY operation.

SQL> select count(*) from (


2 select count(*) from customers
3
group by CUST_STATE_PROVINCE,COUNTRY_ID);
COUNT(*)
---------145
SQL> select column_name,num_distinct from user_tab_col_statistics
2 where table_name='CUSTOMERS'
3 and column_name in ('CUST_STATE_PROVINCE','COUNTRY_ID');
COLUMN_NAME
NUM_DISTINCT
------------------------------ -----------CUST_STATE_PROVINCE
145
COUNTRY_ID
19
SQL> select count(*) from customers
2
group by CUST_STATE_PROVINCE,COUNTRY_ID;
Execution Plan
---------------------------------------------------------Plan hash value: 1577413243
-------------------------------------------------------------------------------| Id

| Operation

| Name

| Rows

| Bytes | Cost (%CPU)| Time

-------------------------------------------------------------------------------|

0 | SELECT STATEMENT

1949 | 31184 |

408

(1)| 00:00:05 |

1 |

2 |

HASH GROUP BY

1949 | 31184 |

TABLE ACCESS FULL| CUSTOMERS | 55500 |

867K|

408

(1)| 00:00:05 |

406

(1)| 00:00:05 |

--------------------------------------------------------------------------------

SQL>

Without extended statistics QO estimate cardinality as 145*19/sqrt (2)~1948.097


But with extended statistics estimated cardinality will correct
SQL> begin DBMS_STATS.GATHER_TABLE_STATS(
2

'SH',

'CUSTOMERS',

estimate_percent=>null,

5
1');

METHOD_OPT =>'FOR COLUMNS (cust_state_proince,country_id) size

end;

PL/SQL procedure successfully completed.


SQL> select count(*) from customers
2

group by CUST_STATE_PROVINCE,COUNTRY_ID;

Execution Plan
---------------------------------------------------------Plan hash value: 1577413243
------------------------------------------------------------------------------| Id

| Operation

| Name

| Rows

| Bytes | Cost (%CPU)| Time

------------------------------------------------------------------------------|

0 | SELECT STATEMENT

145 | 2320 |

1 |

145 |

2320 |

408

(1)| 00:00:05

2 |

TABLE ACCESS FULL| CUSTOMERS | 55500 |

867K|

406

(1)| 00:00:05

HASH GROUP BY

408

(1)| 00:00:05

Identifying candidate columns to creating column groups based on workload statistics.


Oracle provides some procedures to finding candidate columns for column groups. But this method is not work
based on statistics or real data. It is looks works just find candidate columns from dynamic performance views
(like v$sql,v$sql_plan). It means in this case oracle do not investigate/find real columns correlation. Lets see
below example.

SQL> create table t_candidate


2

as

select

trunc(dbms_random.value(0,25)) p1,

trunc(dbms_random.value(0,20)) p2,

lpad(rownum,10,'0') padding

from

all_objects

where

10

rownum <= 10000

11

Table created.
SQL>
SQL> begin
2

dbms_stats.gather_table_stats(

user,

't_candidate',

cascade => true,

estimate_percent=>null,

method_opt=> 'for all columns size 1');

end;

PL/SQL procedure successfully completed.


SQL> Exec DBMS_STATS.SEED_COL_USAGE(null,null,120);
PL/SQL procedure successfully completed.
SQL> select count(*) from

t_candidate where

p1=19 and p2=14;

COUNT(*)
---------19
SQL>
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
SQL_ID

9g4vdacy7pc62, child number 0

------------------------------------select count(*) from

t_candidate where

p1=19 and p2=14

Plan hash value: 374408457


| Id

| Operation

| Name

| Rows

| Bytes | Cost (%CPU)| Time

|
-------------------------------------------------------------------------------|

0 | SELECT STATEMENT

12 (100)|

1 |

1 |

6 |

|*

2 |

TABLE ACCESS FULL| T_CANDIDATE |

20 |

120 |

SORT AGGREGATE

12

(0)| 00:00:01

-------------------------------------------------------------------------------Predicate Information (identified by operation id):


2 - filter(("P1"=19 AND "P2"=14))
19 rows selected.
SQL>

SET LONG 7000

SQL>

SET LONGCHUNKSIZE 7000

SQL>

SET LINESIZE 500

SQL>

Select dbms_stats.report_col_usage('SH','t_candidate') from dual ;

DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
--------------------------------------------------------------------------------------------------LEGEND:
.......
EQ

: Used in single table EQuality predicate

RANGE

: Used in single table RANGE predicate

LIKE

: Used in single table LIKE predicate

NULL

: Used in single table is (not) NULL predicate

EQ_JOIN

: Used in EQuality JOIN predicate

NONEQ_JOIN : Used in NON EQuality JOIN predicate


FILTER

: Used in single table FILTER predicate

JOIN

: Used in JOIN predicate

DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
---------------------------------------------------------------------------------------------------

GROUP_BY

: Used in GROUP BY expression

...............................................................................
###############################################################################
COLUMN USAGE REPORT FOR SH.T_CANDIDATE
......................................
1. P1

: EQ

2. P2

: EQ

3. (P1, P2)

: FILTER

DBMS_STATS.REPORT_COL_USAGE('SH','T_CANDIDATE')
SQL> select dbms_stats.create_extended_stats('SH','t_candidate') from dual;
DBMS_STATS.CREATE_EXTENDED_STATS('SH','T_CANDIDATE')
--------------------------------------------------------------------------------------------------###############################################################################

EXTENSIONS FOR SH.T_CANDIDATE


.............................

1. (P1, P2)

: SYS_STUIV1F__U9NUVZ7#MDKL81$SY created

###############################################################################
SQL> exec dbms_stats.gather_table_stats('SH','t_candidate',method_opt=>'for all columns size
skewonly for columns (p1,p2) size skewonly');
PL/SQL procedure successfully completed.
SQL> select column_name,num_distinct,histogram from user_tab_col_statistics where
table_name='T_CANDIDATE';
COLUMN_NAME

NUM_DISTINCT HISTOGRAM

------------------------------ ------------ --------------P1

25 FREQUENCY

P2

20 FREQUENCY

PADDING

10000 HEIGHT BALANCED

SYS_STUIV1F__U9NUVZ7#MDKL81$SY
SQL> select count(*) from
COUNT(*)
----------

500 NONE

t_candidate where

p1=19 and p2=14;

19
SQL> select * from table(dbms_xplan.display_cursor);
PLAN_TABLE_OUTPUT
-----------------------------------------------------------SQL_ID

9g4vdacy7pc62, child number 0

------------------------------------select count(*) from

t_candidate where

p1=19 and p2=14

Plan hash value: 374408457


-------------------------------------------------------------------------------| Id

| Operation

| Name

| Rows

| Bytes | Cost (%CPU)| Time

0 | SELECT STATEMENT

12 (100)|

1 |

1 |

6 |

|*

2 |

TABLE ACCESS FULL| T_CANDIDATE |

20 |

120 |

SORT AGGREGATE

12

(0)| 00:00:01

Predicate Information (identified by operation id):


2 - filter(("P1"=19 AND "P2"=14))
SQL>

So this method is not discover only correlated, result is: candidate columns also contain noncorrelated/independent columns.
Also QO can use column groups statistics through composite index without obviously adding to data dictionary.
In this case selectivity will calculate based on DISTINCT_KEYS of this index (but I fully not investigate that).
Another question can be related sql profile (SQP) and correlation data. Of course if here is columns correlation
then you can use SQP if in result of sql tuning adviser task appear accepting sql profile. SQP is collection of
internal hints (like opt_estimate), using offline optimization method it estimate selectivity/cardinality accurately
and give information for online optimizer to choosing best plan. Finally note that still Oracle`s QO cannot
(estimate selectivity of correlated columns) use extended statistics for non-equal, range and out of bound
predicate, may be in such cases need additional statistics(also gathering method) and will solve future releases.

Vous aimerez peut-être aussi