As Oracle has evolved through version 7, 8 and 8i, the commands for generating statistics have evolved to allow better collection of statistics for the cost based optimizer, culminating with the DBMS_STATS package from 8.1.5 onwards. However, a shortcoming was that if you wanted to use the "global" routines (that is, collection at a schema or database level), you were committed to use the same granularity to collect the stats. That is, you could ESTIMATE for all segments, or COMPUTE for all segments and not a combination of the two. This is a pain when you want to possibly use COMPUTE for small segments, but only an ESTIMATE for larger segments (or perhaps vary the estimation percentage dependent on segment size)

Most of us thus wrote our own routines to cycle through DBA_SEGMENTS to perform the required work ourselves. Oracle have tried to address this in 9i, with the new DBMS_STATS.AUTO_SAMPLE_SIZE facility. In this mechanism, Oracle will try to determine an appropriate sample size for each segment encountered.

Here is an example:

REM REM First we create (roughly) a 100m table, the reason for the name THIN is that each row is REM very thin, the reason for which will became apparent later in this note REM create table THIN ( x number(2)) pctfree 0; REM REM Stick about 100M of data in there REM declare byt number; begin loop insert /*+ APPEND */ into THIN select mod(rownum,10) from sys.source$; commit; select bytes into byt from user_segments where segment_name = 'THIN'; exit when byt > 100000000; end loop; end; / REM REM And now use the 'auto' sample size option REM begin dbms_stats.GATHER_TABLE_STATS( OWNNAME=>'MCDONAC', TABNAME=>'THIN', ESTIMATE_PERCENT=>dbms_stats.auto_sample_size); end; /

If we check USER_TABLES after this operation, we see that Oracle chose a sample size of 0.07. In fact, if we trace the operation, we can see the following work being done:

select /*+ */ count(*) from "MCDONAC"."THIN" sample block (.001) select /*+ */ count(*) from "MCDONAC"."THIN" sample block (.1) followed by the actual analyze work select /*+ */ count(*),count("X"),count(distinct "X"),sum(vsize("X")), substrb(dump(min("X"),16,0,32),1,120), substrb(dump(max("X"),16,0,32),1,120) from "MCDONAC"."THIN" sample (.0715773034877667881311816762103071317022) t

So the first thing you should immediately take heed of is that to determine the sample size, DBMS_STATS will do
*extra* work on your table, which will have some performance overhead.

However, let us now try a "wide" table, populated up to 100M as well

create table wide ( x number(2), p char(1000)) pctfree 50; REM REM Stick about 100M of data in there, in fact, there ended up being closer to 200M in this table, since REM it looped around just under the 100M threshold REM declare byt number; begin loop insert /*+ APPEND */ into wide select mod(rownum,10),'x' from sys.source$; commit; select bytes into byt from user_segments where segment_name = 'WIDE'; exit when byt > 100000000; end loop; end; / REM REM And now use the 'auto' sample size option REM begin dbms_stats.GATHER_TABLE_STATS( OWNNAME=>'MCDONAC', TABNAME=>'WIDE', ESTIMATE_PERCENT=>dbms_stats.auto_sample_size); end; /

In this case, even though the table is nearly twice the size, our sample estimate used was 5.5 (8 times the size used for table THIN). Also, the trace file reveals even more preliminary work being done:

select /*+ */ count(*) from "MCDONAC"."WIDE" sample block (.001) select /*+ */ count(*) from "MCDONAC"."WIDE" sample block (.1) select /*+ */ count(*) from "MCDONAC"."WIDE" sample block (10) and then the actual analyze work select /*+ */ count(*),count("X"),count(distinct "X"),sum(vsize("X")), substrb(dump(min("X"),16,0,32),1,120),substrb(dump(max("X"),16,0,32),1,120), count("P"),count(distinct "P"),substrb(dump(min(substrb("P",1,32)),16,0,32),1,120), substrb(dump(max(substrb("P",1,32)),16,0,32),1,120) from "MCDONAC"."WIDE" sample (5.50220088035214085634253701480592236895) t

Thus it would appear that Oracle attempts to determine an approximate count of the rows in the table to assess an appropriate sample size. Whilst this is a reasonable strategy, it probably also needs to take into account the size of the segment (or at least an approximation) as well.

So whilst this feature looks attractive, be warned that

- it can be skewed by very thin or very wide rows
- it will perform additional work in order to determine the sample size