Page tree

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 161 Next »

Extrapolation factors

For details on how to prepare the data transmission please see further instructions and examples on chapter IFS Sampling design and extrapolation factors - IFS-Integrated-Farm-Statistics - EC Extranet Wiki (europa.eu).

Variance estimation and quality rating system

Variance estimation

Variance is estimated in SAS by invoking the built-in procedures proc surveymeans and proc surveyfreq:
  • the proc surveymeans procedure is used to estimate the coefficients of variation for totals of continuous variables and for ratios,

  • the proc surveyfreq procedure is used to estimate the standard errors for proportions and for shares.

Counts are first converted to proportions and then the standard errors are estimated for the corresponding proportions. A count is converted to a proportion by considering the total number of holdings in the country as denominator. The total number of holdings in the country is used as fixed denominator for all proportions. This is important to keep in mind, because the proportion value depends on the denominator and the standard error value depends on the proportion value. The following section presents some specific issues which should be considered when estimating variance, according to the discussion and conclusions of the 2017 Working Group.

The Eurobase tables in SAS have the following statistics calculated for each cell:

VALUEThe cell data (the total of continuous variable or the count variable)

TOTAL_WGT

The extrapolated number of holdings in the cell

CVSUM

The coefficient of variation for the cell data, calculated for the case of total of continuous variables

CVSUM=SQRT(VARSUM)/VALUE

VARSUM

The variance of the cell data, calculated for the case of total of continuous variables

SEThe standard error for the cell data,  calculated for the case of count
VARPROP

The variance of the cell data, calculated for the case of count

VARPROP=SE**2

Variance estimation formulae and SAS syntaxes

The disseminated variables are of different types:

  • totals of continuous variables (e.g. total number of hectares of cereals)
  • means of continuous variables (e.g. average number of hectares of cereals per holding)
  • counts (e.g. number of holdings with cereals)
  • proportions (e.g. proportion of holdings with cereals in the total number of holdings)

Eurostat disseminates totals of continuous variables, counts, ratios and shares in FSS and IFS.


TOTALS OF CONTINUOUS VARIABLES

Totals of continuous variables are those variables that have the unit of measurement (recorded in the UNIT field in the Eurobase tables of Eurostat) expressed in:

  • HA (hectares),
  • HD (heads),
  • THS_HEADS (1000 heads),
  • HIVE (hives),
  • LSU (livestock units),
  • EUR (euro),
  • PLC (places),
  • PERS (persons), or
  • AWU (annual working units).

Examples include permanent crops (HA), plants harvested green (HA), pigs (LSU), live bovine animals (HD), poultry (THS_HD), total livestock units (LSU), family labour force (AWU), non-family labour force (PERS) or Laying hens (PLC).

Figure 55 – Totals of continuous variables in Eurobase

For these types of variables, Eurostat calculates the coefficients of variation.

Variance estimation in IFS

In Integrated Farm Statistics the dataset fields offer the possibility to record information allowing for basically all types of sampling designs, including those more complex than one-stage stratified random sampling.

Eurostat uses the ultimate cluster approximation approach to estimate variance. The formulae apply for all sampling designs and they also apply for censuses. In case of a census, the weights are equal to 1 or slightly different from 1 (as consequence of non-response adjustment and calibration).

Eurostat considers the use of systematic sampling through the calculation of "computational strata" based on the values in the OSU_S1_* fields.  * stands for CORE, LAFO, RDEV, etc. It does so in the SAS syntaxes, before applying the formula through the proc surveymeans procedure.  Actually, in order to reduce the computation time:

  • only for the countries with field OSU_S1_* different from null, the SAS programme contains the whole set of SAS syntaxes presented below, including the computation and use of the field _CST (computational stratum) in the proc surveymeans procedure. 
  • for the rest of the countries (with field OSU_S1_* null), the SAS programme is reduced, and does not include the computation and use of the field _CST (computational stratum) in the proc surveymeans procedure. 

Also in order to reduce the computation time to a reasonable level:

  • where the number of strata is very high, Eurostat exceptionally computes the variance by regions and not by country. For 2020, this was the case for EL. This assumes that strata are nested into regions. 
  • where the data do not have the PSU_* field completed, Eurostat removes the option "Cluster" from the proc surveymeans procedure. This was the case for 2020 for all countries.
  • proc surveymeans excludes countries and strata that have weights equal to 1 for all holdings. However, countries and strata that have weights equal to 1 for all holdings  will eventually get CVSUM of 0 or null. proc surveymeans computes SE and then CVSUM is computed from SE; CVSUM must not be taken from proc surveymeans, since in the case of exclusion of strata with weights 1, the calculated SE is correct, but the VALUE(SUM) and consequently CVSUM is not correct. CVSUM must be calculated outside the proc surveymeans procedure by dividing SE and VALUE(SUM). 


ULTIMATE CLUSTER APPROXIMATION with stratification at the first stage

Variance and coefficient of variation of the total (general case, where weights may be different within a stratum as consequence of non-response adjustment and calibration .

SAS syntaxes



Variance estimation at country level

The following is an example of SAS syntaxes that estimate the coefficients of variation for the total Utilised agricultural area (UAA) (HA) in Portugal in a table with dimensions FARMTYPE, SO_EUR (standard output size classes) and AGRAREA (utilised agricultural area size classes) and in a table with dimensions FARMTYPE and SO_EUR. Because UAA is a CORE variable, PROC SURVEYMEANS uses the Extrapolation factor of the CORE as weight.

*Determine the extrapolation factor and the coverage of the table ;

data DATA_CORE;

set IFSORA.IFS_T_MAIN_2020 ;

*Our table covers Portugal and the data in the CORE. It covers only the holdings in the main frame (HLD_FEF="0");
*We also can select 2 or more countries;

where (country=" PT " and EXTPOL_FACT1_CORE is not missing and HLD_FEF="0");

*We make the following replacement, in order to allow valid computations also for censuses. The replacement is not made in the original dataset but in the intermediary table DATA_CORE;

if STRA_IDF_CORE ='_Z' then STRA_ID_CORE= 1 ;

*We replace null extrapolation factors with 1 but only to compute the "weight_core". The fields EXTPOL_FACT*_CORE remain unchanged;

*The first extrapolation factor for CORE is always completed and never null (unlike for the MODULES);

if missing(EXTPOL_FACT2_CORE) then EXTPOL_FACT2_CORE_n= 1 ; else EXTPOL_FACT2_CORE_n=EXTPOL_FACT2_CORE;

if missing(EXTPOL_FACT3_CORE) then EXTPOL_FACT3_CORE_n= 1 ; else EXTPOL_FACT3_CORE_n=EXTPOL_FACT3_CORE;

weight_core=EXTPOL_FACT1_CORE*EXTPOL_FACT2_CORE_n*EXTPOL_FACT3_CORE_n;

if OSU_S1_CORE=. then OSU_S1_CORE= 1 ;

if PSU_CORE=. then PSU_CORE=HLD_ID;

UAA=UAAS+UAAT;

run ;

**********************************************

PROC SURVEYMEANS;

**********************************************

Construction of computational strata, depending on OSU_S1_ and PSU_ ;

PROC SQL ;

  CREATE TABLE _CST1 AS SELECT

    COUNTRY, STRA_ID_CORE,PSU_CORE,MIN(OSU_S1_CORE) AS OSU_S1_CORE

    FROM DATA_CORE GROUP BY COUNTRY, STRA_ID_CORE,PSU_CORE;

QUIT ;

PROC SORT DATA=_CST1;

   BY COUNTRY STRA_ID_CORE OSU_S1_CORE;

RUN ;

***Within each formal stratum STRA_ID_:

The first record of each formal stratum receives _SEQ=1 then the following records of the same stratum receive _SEQ = an incrementing number by 1.

If OSU_S1 (the rank of systematic sampling) is incrementing (so a systematic sampling is used) and the record has _SEQ>2 (so the record is not the first and not the second, these first 2 records form the first computational stratum), then the formal stratum STRA_ID_ gets split i.e. a new computational stratum _CST is created within the formal stratum STRA_ID_. The process is iterative (DO - END). The number of records for each computational stratum is 2 (defined by _SEQ>2).

The split is not done before a record if that record is the last one in the formal stratum. This is done in order to avoid that the last computational stratum has less than 2 records. ;

DATA _CST;

  SET _CST1;

  RETAIN _CST 1 _SEQ;

  BY COUNTRY STRA_ID_CORE;

  P_OSU_S1_CORE=LAG(OSU_S1_CORE);

  IF FIRST.STRA_ID_CORE THEN DO;

               _SEQ= 1 ;

              P_OSU_S1_CORE= . ;

              _CST= 1 ;

  END;

  ELSE _SEQ+1 ;

  IF OSU_S1_CORE>P_OSU_S1_CORE AND _SEQ> 2 AND NOT(LAST.STRA_ID_CORE) THEN DO;

             _CST+1 ;

             _SEQ= 1 ;

END;

RUN ;

***Generalisation;

/****In case performance problems occur with the above syntax, please try using the below syntax instead of the above syntax where you can define _NCST (new computational strata) = 3 or 4 etc. instead of 2.

Please let us (the Eurostat farm structure team) know if you had to use this syntax and which value you took for _NCST

/*

%LET _NCST=3;

DATA _CST;

   SET _CST1;

   RETAIN _CST 1 _SEQ;

   BY COUNTRY STRA_ID_CORE;

  P_OSU_S1_CORE=LAG(OSU_S1_CORE);

  IF FIRST.STRA_ID_CORE THEN DO;

            _SEQ=1;

           P_OSU_S1_CORE=.;

           _CST=1;

END;

ELSE _SEQ+1;

IF OSU_S1_CORE>P_OSU_S1_CORE AND _SEQ>&_NCST. AND NOT(LAST.STRA_ID_CORE) THEN DO;

          _CST+1;

         _SEQ=1;

END;

DROP _SEQ P_OSU_S1_CORE ;

RUN;

/*;

PROC SQL ;

   CREATE TABLE DATA_CORE AS SELECT A.*,B._CST

   FROM DATA_CORE A

   LEFT JOIN _CST B ON A.COUNTRY=B.COUNTRY AND A.STRA_ID_CORE=B.STRA_ID_CORE AND A.PSU_CORE=B.PSU_CORE;

QUIT ;

PROC SORT DATA=DATA_CORE;

   BY COUNTRY STRA_ID_CORE _CST;

RUN ;

proc sql ;

create table pop_STRA_ID as

select COUNTRY, STRA_ID_CORE, _CST, sum(weight_core) as _total_, count(*) as sample

from data_core group by COUNTRY, STRA_ID_CORE, _CST;

quit ;

data pop_STRA_ID;

set pop_STRA_ID;

if _total_ < sample then _total_=sample;

run ;

proc sort data=data_core;

by COUNTRY STRA_ID_CORE _CST;

run ;

ods exclude all;

proc surveymeans data=data_core total=pop_STRA_ID sum varsum cvsum clsum ;

by COUNTRY;

domain FARMTYPE*SO_EUR*AGRAREA FARMTYPE*SO_EUR;

var UAA;

strata STRA_ID_CORE _CST;

cluster PSU_CORE;

weight WEIGHT_CORE;

ods output domain=cv;

run ;

The result is the coefficients of variation stored in field CVSUM in the table cv.

Variance estimation in FSS

In Farm Structure Surveys until 2016 the dataset fields allowed to record information assuming only census or one-stage stratified random sampling.

The following table presents the formula for variance estimation for one-stage stratified random sampling. In case of a census, the formulae also apply, the weights whi being equal to 1 or slightly different from 1 (as consequence of non-response adjustment and calibration).

In order to reduce the computation time to a reasonable level:

  • where the number of strata is very high, Eurostat exceptionally computes the variance by regions and not by country.  For FSS, this was the case for a few countries. This assumes that strata are nested into regions.
  • proc surveymeans excludes countries and strata that have weights equal to 1 for all holdings. However, countries and strata that have weights equal to 1 for all holdings will eventually get CVSUM of 0 or null. proc surveymeans computes SE and then CVSUM is computed from SE; CVSUM must not be taken from proc surveymeans, since in the case of exclusion of strata with weights 1, the calculated SE is correct, but the VALUE(SUM) and consequently CVSUM is not correct. CVSUM must be calculated outside the proc surveymeans procedure by dividing SE and VALUE(SUM). 

ONE-STAGE STRATIFIED RANDOM SAMPLING

Variance and coefficient of variation of the total in domain d

(general case, where weights may be different within a stratum as consequence of non-response adjustment and calibration)


SAS syntaxes

Variance estimation at country level

The following is an example of SAS syntaxes that estimate the coefficients of variation for the total Utilised agricultural area (HA) in Portugal in a table with dimensions FARMTYPE, SO_EUR (standard output size classes) and AGRAREA (utilised agricultural area size classes) and in a table with dimensions FARMTYPE and SO_EUR.

PROC SURVEYMEANS uses the A09_Number field as weight. This is the extrapolation factor field (and not A10_Number) which should be used for Utilised agricultural area in Portugal. 

********************************

PROC SURVEYMEANS

*****************************

data data_2016;

set FSSORA.FSS_T_MAIN_2016 ;

*Here we select only "PT", but we can also select 2 or more countries;

where country="PT";

* We make the following replacement, in order to allow valid computations also for censuses. The replacement is not made in the original dataset, but in the intermediary table data_2016;

if A09A_ID =. then A09A_ID =1 ;

run ;

proc sql ;

create table pop_STRA_ID as

select country, A09A_ID, sum(A09_Number) as _total_, count(*) as sample

from data_2016

group by country, A09A_ID;

quit ;


data pop_STRA_ID;

set pop_STRA_ID;

if _total_ < sample then _total_=sample;

run ;

proc sort data=data_2016;

by country A09A_ID;

run ;

proc surveymeans data=data_2016 total=pop_STRA_ID sum varsum cvsum clsum ;

by country;

domain FARMTYPE*SO_EUR*AGRAREA  FARMTYPE*SO_EUR;

var A_3_1_ha;

strata A09A_ID ;

weight A09_Number;

ods output domain=cv;

run ;

The result is the coefficients of variation stored in field CVSUM in the table cv.

Variance estimation in IFS and FSS for European aggregates 

Where:
 is the VARSUM at country level (REGION=”COUNTRY”), for the same combination of dimension values as the EU aggregate in question. For example, if the line of the EU aggregate has dimensions FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”, then the formula includes the 27 VARSUMs corresponding to the lines where REGION=”COUNTRY”, FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”.
 is the VALUE of the EU aggregate in question.


COUNTS

Counts are those variables that have the UNIT expressed in HLD such as holdings with livestock, number of holdings with pigs or number of holdings consuming more than 50% of the final production.

Figure 56 – Count variables in Eurobase

Counts are first converted to proportions and then the standard errors are estimated for the corresponding proportions. A count is converted to a proportion by considering the total number of holdings in the country as denominator. The total number of holdings in the country is used as fixed denominator for all proportions. This is important to keep in mind, because the proportion value depends on the denominator and the standard error value depends on the proportion value.


Example: the proportion of holdings with pigs is the number of holdings with pigs in the total number of holdings in the country.

The approach is to calculate the standard errors for the proportions (that correspond to the counts) and to flag the counts based on the standard errors of the corresponding proportions.


Variance estimation in IFS

In Integrated Farm Statistics the dataset fields offer the possibility to record information allowing for basically all types of sampling designs, including those more complex than one-stage stratified random sampling.

Eurostat uses the ultimate cluster approximation approach to estimate variance. The formulae apply for all sampling designs and they also apply for censuses. In case of a census, the weights are equal to 1 or slightly different from 1 (as consequence of non-response adjustment and calibration).

Eurostat considers the use of systematic sampling through the calculation of "computational strata" based on the values in the OSU_S1_* fields.  * stands for CORE, LAFO, RDEV, etc.  It does so in the SAS syntaxes, before applying the formula through the proc surveyfreq procedure.  Actually, in order to reduce the computation time:

  • only for the countries with field OSU_S1_* different from null, the SAS programme contains the whole set of SAS syntaxes presented below, including the computation and use of the field _CST (computational stratum) in the proc surveyfreq procedure. 
  • for the rest of the countries (with field OSU_S1_* null), the SAS programme is reduced, and does not include the computation and use of the field _CST (computational stratum) in the proc surveyfreq procedure. 

Also in order to reduce the computation time to a reasonable level:

  • where the number of strata is very high, Eurostat exceptionally computes the variance by regions and not by country. For 2020, this was the case for EL. This assumes that strata are nested into regions. 
  • where the data do not have the PSU_* field completed, Eurostat removes the option "Cluster" from the proc surveyfreq procedure. This was the case for 2020 for all countries.
  • proc surveyfreq excludes countries and strata that have weights equal to 1 for all holdings.  However, countries and strata that have weights equal to 1 for all holdings will eventually get SE of 0. proc surveyfreq computes STDDEV and then SE is computed from STDDEV.

ULTIMATE CLUSTER APPROXIMATION with stratification at the first stage

Variance and standard error for proportion of domain d in the population

SAS syntaxes

Variance estimation at country level

The following is an example of SAS syntaxes for estimation of standard errors for Number of farms with utilised agricultural area (UAA) in Portugal in a table with dimensions FARMTYPE, SO_EUR (standard output size classes) and AGRAREA (utilised agricultural area size classes).

Because UAA is a CORE variable, PROC SURVEYFREQ uses the Extrapolation factor of the CORE as weight.

*******************************************************************

/*Determine the extrapolation factor and the coverage of the table

*******************************************************************;


data DATA_CORE;

set IFSORA.IFS_T_MAIN_2020 ;


* Our table covers Portugal and the data in the CORE. It covers only the holdings in the main frame (HLD_FEF=”0”).

* We also can select 2 or more countries;

where (country="PT" and EXTPOL_FACT1_CORE is not missing and HLD_FEF="0");


* We make the following replacement, in order to allow valid computations also for censuses. The replacement is not made in the original dataset but in the intermediary table DATA_CORE.

if STRA_IDF_CORE ='_Z' then STRA_ID_CORE=1 ;


* We replace null extrapolation factors with 1 but only to compute the “weight_core”. The fields EXTPOL_FACT*_CORE remain unchanged;

* The first extrapolation factor for CORE is always completed and never null (unlike for the MODULES);

if missing(EXTPOL_FACT2_CORE) then EXTPOL_FACT2_CORE_n=1; else EXTPOL_FACT2_CORE_n=EXTPOL_FACT2_CORE;

if missing(EXTPOL_FACT3_CORE) then EXTPOL_FACT3_CORE_n=1; else EXTPOL_FACT3_CORE_n=EXTPOL_FACT3_CORE;

weight_core=EXTPOL_FACT1_CORE*EXTPOL_FACT2_CORE_n*EXTPOL_FACT3_CORE_n;


if OSU_S1_CORE=. then OSU_S1_CORE=1;

if PSU_CORE=. then PSU_CORE=HLD_ID;


keep country year HLD_ID STRA_ID_CORE PSU_CORE OSU_S1_CORE weight_core farmtype SO_EUR AGRAREA UAA ;


run;


*****************************************;

*PROC SURVEYFREQ;

******************************************;


** Construction of computational strata, depending on OSU_S1_ and PSU_

*/;


PROC SQL;

   CREATE TABLE _CST1 AS SELECT COUNTRY, STRA_ID_CORE,PSU_CORE,MIN(OSU_S1_CORE) AS OSU_S1_CORE

   FROM DATA_CORE GROUP BY COUNTRY, STRA_ID_CORE,PSU_CORE;

QUIT;


PROC SORT DATA=_CST1;

   BY COUNTRY STRA_ID_CORE OSU_S1_CORE;

RUN;


*** Within each formal stratum STRA_ID_:

The first record of each formal stratum receives _SEQ=1 then the following records of the same stratum receive _SEQ = an incrementing number by 1.

If OSU_S1 (the rank of systematic sampling) is incrementing (so a systematic sampling is used) and the record has _SEQ>2 (so the record is not the first and not the second, these first 2 records form the first computational stratum),  then the formal stratum STRA_ID_ gets split i.e. a new computational stratum _CST is created within the formal stratum STRA_ID_.

The process is iterative (DO - END).

The number of records for each computational stratum is 2 (defined by _SEQ>2).

The split is not done before a record which is the last one in the formal stratum to avoid that the last computational stratum has less than 2 records.

;

DATA _CST;

   SET _CST1;

   RETAIN _CST 1 _SEQ;

   BY COUNTRY STRA_ID_CORE;

   P_OSU_S1_CORE=LAG(OSU_S1_CORE);

   IF FIRST.STRA_ID_CORE THEN DO;

      _SEQ=1;

      P_OSU_S1_CORE=.;

      _CST=1;

   END;

   ELSE _SEQ+1;

   IF OSU_S1_CORE>P_OSU_S1_CORE AND _SEQ>2 AND NOT(LAST.STRA_ID_CORE) THEN DO;

      _CST+1;

      _SEQ=1;

   END;    

*  DROP _SEQ P_OSU_S1_CORE ;

RUN;


*** Generalisation;

/**** In case performance problems occur with the above syntax, please try using the below syntax instead of the above syntax where you can define _NCST (new computational strata) = 3 or 4 etc.  instead of 2. Please let us (the farm team in E1) know if you had to use this syntax and which value you took for _NCST

/*


%LET _NCST=3;


DATA _CST;

   SET _CST1;

   RETAIN _CST 1 _SEQ;

   BY COUNTRY STRA_ID_CORE;

   P_OSU_S1_CORE=LAG(OSU_S1_CORE);

   IF FIRST.STRA_ID_CORE THEN DO;

      _SEQ=1;

      P_OSU_S1_CORE=.;

      _CST=1;

   END;

   ELSE _SEQ+1;

   IF OSU_S1_CORE>P_OSU_S1_CORE AND _SEQ>&_NCST. AND NOT(LAST.STRA_ID_CORE) THEN DO;

      _CST+1;

      _SEQ=1;

   END;    

   DROP _SEQ P_OSU_S1_CORE ;

RUN;

*/


PROC SQL;

   CREATE TABLE DATA_CORE AS SELECT A.*,B._CST

   FROM DATA_CORE A

   LEFT JOIN _CST B ON A.COUNTRY=B.COUNTRY AND A.STRA_ID_CORE=B.STRA_ID_CORE AND A.PSU_CORE=B.PSU_CORE;

QUIT;


proc sort data=data_core;

by COUNTRY HLD_ID STRA_ID_CORE _CST PSU_CORE OSU_S1_CORE weight_core farmtype SO_EUR AGRAREA UAA ;

run;


proc transpose data=data_core out=data_core_bis (rename=(col1=VALUE _name_=CROPS));

by COUNTRY HLD_ID STRA_ID_CORE _CST PSU_CORE OSU_S1_CORE weight_core farmtype SO_EUR AGRAREA UAA ;

var UAA ;

run;


* transform VALUE into binary variable ;


data data_core_bis;

set data_core_bis;

if value>0 then value=1; else value=0;

run;


data data_core_bis;

set data_core_bis ;

/*Case1: SE calculation for all breakdowns of the dimensions*/ 

dmn_CROPS=FARMTYPE||SO_EUR||AGRAREA||VALUE ; 

* We can write dmn_CROPS=REGIONS||FARMTYPE||SO_EUR||AGRAREA||VALUE if we want Standard error estimates by REGION also; 

 

/*Case2: SE calculation for FARMTYPE=TOTAL breakdown and all breakdowns of SO_EUR and AGRAREA*/ 

dmn_FARMTYPE=SO_EUR||AGRAREA||VALUE; 

 

/*Case3: SE calculation for SO_EUR=TOTAL breakdown and all breakdowns of FARMTYPE and AGRAREA*/ 

dmn_SO_EUR=FARMTYPE||AGRAREA||VALUE ; 

 

/*Case4: SE calculation for AGRAREA=TOTAL breakdown and all breakdowns of FARMTYPE and SO_EUR*/ 

dmn_AGRAREA=FARMTYPE||SO_EUR||VALUE ; 


run;


proc sql;

create table pop_STRA_ID as

select COUNTRY, CROPS,  STRA_ID_CORE,_CST, sum(weight_core) as _total_, count(*) as sample

from data_core_bis

group by COUNTRY, CROPS,  STRA_ID_CORE,_CST;

quit;


data pop_STRA_ID;

set pop_STRA_ID;

if _total_ < sample then _total_=sample;

run;


proc sort data=data_core_bis;

by COUNTRY CROPS STRA_ID_CORE _CST;

run;


/*Case1: SE computed for all breakdowns of dimensions*/ 


proc surveyfreq data=data_core_bis total=pop_STRA_ID ;

   tables dmn_CROPS ;

   by COUNTRY CROPS;

   strata STRA_ID_CORE _CST;

   cluster PSU_CORE;

   weight weight_core;

   ods output OneWay=sfreq_CROPS;

run;


* Keep only the SE where VALUE is 1 (meaning for the Number of holdings with the utilised agricultural area>0) ;


data sfreq_CROPS1 (where=(VALUE="1"));;

set sfreq_CROPS;

FARMTYPE=scan(dmn_CROPS, 1, " ");

SO_EUR=scan(dmn_CROPS, 2, " ");

AGRAREA=scan(dmn_CROPS, 3, " ");

VALUE=scan(dmn_CROPS, 4, " ");

run;


The result is the standard errors stored in field SE in table sfreq_CROPS1.

/*Case2: SE computed for FARMTYPE=TOTAL breakdown and all breakdowns of SO_EUR and AGRAREA*/ 

proc surveyfreq data=data_core_bis total=pop_STRA_ID ; 

   tables dmn_FARMTYPE ; 

   by COUNTRY CROPS; 

   strata STRA_ID_CORE _CST;

   cluster PSU_CORE;

   weight weight_core;

   ods output OneWay=sfreq_CROPS; 

run;


data sfreq_CROPS2 (where=(VALUE="1"));; 

set sfreq_CROPS; 

FARMTYPE='TOTAL'; 

SO_EUR=scan(dmn_FARMTYPE, 1" "); 

AGRAREA=scan(dmn_FARMTYPE, 2" "); 

VALUE=scan(dmn_FARMTYPE, 3" "); 

run


The result is the standard errors stored in field SE in table sfreq_CROPS2.


/*Case3: SE computed for SO_EUR=TOTAL breakdown and all breakdowns of FARMTYPE and AGRAREA*/ 

proc surveyfreq data=data_2016_bis total=pop_STRA_ID ; 

   tables dmn_SO_EUR ; 

   by COUNTRY CROPS; 

   strata STRA_ID_CORE _CST;

   cluster PSU_CORE;

   weight weight_core;

   ods output OneWay=sfreq_CROPS; 

run;

 

data sfreq_CROPS3 (where=(VALUE="1"));; 

set sfreq_CROPS; 

SO_EUR='TOTAL'; 

FARMTYPE=scan(dmn_SO_EUR , 1" "); 

AGRAREA=scan(dmn_SO_EUR , 2" "); 

VALUE=scan(dmn_SO_EUR , 3" "); 

run


The result is the standard errors stored in field SE in table sfreq_CROPS3.

/*Case4: SE computed for AGRAREA=TOTAL breakdown and all breakdowns of FARMTYPE and SO_EUR*/ 

proc surveyfreq data=data_2016_bis total=pop_STRA_ID ; 

   tables dmn_AGRAREA ; 

   by COUNTRY CROPS; 

   strata STRA_ID_CORE _CST;

   cluster PSU_CORE;

   weight weight_core;

   ods output OneWay=sfreq_CROPS; 

run;

 

data sfreq_CROPS4 (where=(VALUE="1"));; 

set sfreq_CROPS; 

AGRAREA='TOTAL'; 

FARMTYPE=scan(dmn_AGRAREA , 1" "); 

SO_EUR=scan(dmn_AGRAREA , 2" "); 

VALUE=scan(dmn_AGRAREA , 3" "); 

run


The result is the standard errors stored in field SE in table sfreq_CROPS4.

Variance estimation in FSS

In Farm Structure Surveys until 2016 the dataset fields allowed to record information assuming only census or one-stage stratified random sampling.

The following table presents the formula for variance estimation for one-stage stratified random sampling. In case of a census, the formulae also apply, the weights whi being equal to 1 or slightly different from 1 (as consequence of non-response adjustment and calibration).

In order to reduce the computation time to a reasonable level:

  • where the number of strata is very high, Eurostat exceptionally computes the variance by regions and not by country. For FSS, this was the case for a few countries. This assumes that strata are nested into regions. 
  • proc surveyfreq excludes countries and strata that have weights equal to 1 for all holdings.  However, countries and strata that have weights equal to 1 for all holdings will eventually get SE of 0. proc surveyfreq computes STDDEV and then SE is computed from STDDEV.

ONE-STAGE STRATIFIED RANDOM SAMPLING

Variance and standard error of the proportion in domain d (general case, where weights may be different within a stratum as consequence of non-response adjustment and calibration)

SAS syntaxes

Variance estimation at country level

The following is an example of SAS syntaxes for estimation of standard errors for Number of farms with utilised agricultural area in Portugal in a table with dimensions FARMTYPE, SO_EUR (standard output size classes) and AGRAREA (utilised agricultural area size classes).

PROC SURVEYFREQ uses the A09_Number field as weight. This is the extrapolation factor field (and not A10_Number) which should be used for the Number of farms.

*****************************************;

*PROC SURVEYFREQ;

******************************************;


data data_2016;

set FSSORA.FSS_T_MAIN_2016 ;

*Here we select only "PT", but we can also select 2 or more countries;

where country="PT";

* We make the following replacement, in order to allow valid computations also for censuses. The replacement is not made in the original dataset, but in the intermediary table data_2016;

if A09A_ID =. then A09A_ID =1;

keep country year A08_ID A09A_ID A09_Number A10_Number Regions farmtype SO_EUR AGRAREA A_3_1_ha ;

run;


proc sort data=data_2016;

by COUNTRY A08_ID A09A_ID A09_Number A10_Number Regions farmtype SO_EUR AGRAREA A_3_1_ha ;

run;


proc transpose data=data_2016 out=data_2016_bis (rename=(col1=VALUE _name_=CROPS));

by country A08_ID A09A_ID A09_Number A10_Number Regions farmtype SO_EUR AGRAREA  A_3_1_ha  ;

var A_3_1_ha  ;

run;


* transform VALUE into binary variable ;


data data_2016_bis;

set data_2016_bis;

if value>0 then value=1; else value=0;

run;


data data_2016_bis;

set data_2016_bis ;


/*Case1: SE calculation for all breakdowns of the dimensions*/ 

dmn_CROPS=FARMTYPE||SO_EUR||AGRAREA||VALUE ; 

* We can write dmn_CROPS=REGIONS||FARMTYPE||SO_EUR||AGRAREA||VALUE if we want Standard error estimates by REGION also; 


/*Case2: SE calculation for FARMTYPE=TOTAL breakdown and all breakdowns of SO_EUR and AGRAREA*/ 

dmn_FARMTYPE=SO_EUR||AGRAREA||VALUE; 

 

/*Case3: SE calculation for SO_EUR=TOTAL breakdown and all breakdowns of FARMTYPE and AGRAREA*/ 

dmn_SO_EUR=FARMTYPE||AGRAREA||VALUE ; 

 

/*Case4: SE calculation for AGRAREA=TOTAL breakdown and all breakdowns of FARMTYPE and SO_EUR*/ 

dmn_AGRAREA=FARMTYPE||SO_EUR||VALUE ; 

run;


proc sql;

create table pop_STRA_ID as

select COUNTRY, CROPS, country, A09A_ID, sum(A09_Number) as _total_, count(*) as sample

from data_2016_bis

group by COUNTRY, CROPS, country, A09A_ID;

quit;


data pop_STRA_ID;

set pop_STRA_ID;

if _total_ < sample then _total_=sample;

run;


proc sort data=data_2016_bis;

by COUNTRY CROPS A09A_ID;

run;


/*Case1: SE computed for all breakdowns of dimensions*/ 

proc surveyfreq data=data_2016_bis total=pop_STRA_ID ; 

     tables dmn_CROPS ; 

     by COUNTRY CROPS; 

     strata A09A_ID; 

     weight A09_Number; 

     ods output OneWay=sfreq_CROPS; 

run;


* Keep only the SE where VALUE is 1 (meaning for the Number of holdings with the utilised agricultural area>0) ; 


data sfreq_CROPS1 (where=(VALUE="1"));; 

set sfreq_CROPS; 

FARMTYPE=scan(dmn_CROPS, 1" "); 

SO_EUR=scan(dmn_CROPS, 2" "); 

AGRAREA=scan(dmn_CROPS, 3" "); 

VALUE=scan(dmn_CROPS, 4" "); 

Run;


The result is the standard errors stored in field SE in table sfreq_CROPS1.

 

/*Case2: SE computed for FARMTYPE=TOTAL breakdown and all breakdowns of SO_EUR and AGRAREA*/ 

proc surveyfreq data=data_2016_bis total=pop_STRA_ID ; 

   tables dmn_FARMTYPE ; 

   by COUNTRY CROPS; 

   strata A09A_ID; 

   weight A09_Number; 

   ods output OneWay=sfreq_CROPS; 

run;

 

data sfreq_CROPS2 (where=(VALUE="1"));; 

set sfreq_CROPS; 

FARMTYPE='TOTAL'; 

SO_EUR=scan(dmn_FARMTYPE, 1" "); 

AGRAREA=scan(dmn_FARMTYPE, 2" "); 

VALUE=scan(dmn_FARMTYPE, 3" "); 

run


 The result is the standard errors stored in field SE in table sfreq_CROPS2.


/*Case3: SE computed for SO_EUR=TOTAL breakdown and all breakdowns of FARMTYPE and AGRAREA*/ 

proc surveyfreq data=data_2016_bis total=pop_STRA_ID ; 

   tables dmn_SO_EUR ; 

   by COUNTRY CROPS; 

   strata A09A_ID; 

   weight A09_Number; 

   ods output OneWay=sfreq_CROPS; 

run;


data sfreq_CROPS3 (where=(VALUE="1"));; 

set sfreq_CROPS; 

SO_EUR='TOTAL'; 

FARMTYPE=scan(dmn_SO_EUR , 1" "); 

AGRAREA=scan(dmn_SO_EUR , 2" "); 

VALUE=scan(dmn_SO_EUR , 3" "); 

run


The result is the standard errors stored in field SE in table sfreq_CROPS3.


/*Case4: SE computed for AGRAREA=TOTAL breakdown and all breakdowns of FARMTYPE and SO_EUR*/ 

proc surveyfreq data=data_2016_bis total=pop_STRA_ID ; 

   tables dmn_AGRAREA ; 

   by COUNTRY CROPS; 

   strata A09A_ID; 

   weight A09_Number; 

   ods output OneWay=sfreq_CROPS; 

run;

 

data sfreq_CROPS4 (where=(VALUE="1"));; 

set sfreq_CROPS; 

AGRAREA='TOTAL'; 

FARMTYPE=scan(dmn_AGRAREA , 1" "); 

SO_EUR=scan(dmn_AGRAREA , 2" "); 

VALUE=scan(dmn_AGRAREA , 3" "); 

run


The result is the standard errors stored in field SE in table sfreq_CROPS4.

Variance estimation in IFS and FSS for European aggregates 

Where:
 is the VARPROP at country level (REGION=”COUNTRY”), for the same combination of dimension values as the EU aggregate in question. For example, if the line of the EU aggregate has dimensions FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”, then the formula includes the 27 VARPROPs corresponding to the lines where REGION=”COUNTRY”, FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”.

is the TOTAL_WGT at country level (REGION=”COUNTRY”), for the same combination of dimension values as the EU aggregate in question. For example, if the line of the EU aggregate has dimensions FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”, then the formula includes the 27 TOTAL_WGTs corresponding to the lines where REGION=”COUNTRY”, FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”.

SHARES

Variance estimation in IFS

In Integrated Farm Statistics the dataset fields offer the possibility to record information allowing for basically all types of sampling designs, including those more complex than one-stage stratified random sampling.

Eurostat uses the ultimate cluster approximation approach to estimate variance. The formulae apply for all sampling designs and they also apply for censuses. In case of a census, the weights are equal to 1 or slightly different from 1 (as consequence of non-response adjustment and calibration).

Eurostat considers the use of systematic sampling through the calculation of "computational strata" based on the values in the OSU_S1_* fields.  * stands for CORE, LAFO, RDEV, etc.  It does so in the SAS syntaxes, before applying the formula through the proc surveyfreq procedure.  Actually, in order to reduce the computation time:

  • only for the countries with field OSU_S1_* different from null, the SAS programme contains the whole set of SAS syntaxes presented below, including the computation and use of the field _CST (computational stratum) in the proc surveyfreq procedure. 
  • for the rest of the countries (with field OSU_S1_* null), the SAS programme is reduced, and does not include the computation and use of the field _CST (computational stratum) in the proc surveyfreq procedure. 

Also in order to reduce the computation time to a reasonable level:

  • where the number of strata is very high, Eurostat exceptionally computes the variance by regions and not by country. For 2020, this was the case for EL. This assumes that strata are nested into regions. 
  • where the data do not have the PSU_* field completed, Eurostat removes the option "Cluster" from the proc surveyfreq procedure. This was the case for 2020 for all countries.



ULTIMATE CLUSTER APPROXIMATION with stratification at the first stage

Variance and standard error of the proportion of holdings with characteristic C in domain d (general case, where weights may be different within a stratum as consequence of non-response adjustment and calibration)

 (SAS manual, surveyfreq, page 6325)

SAS syntaxes

Variance estimation at country level

We provide example of SAS code that is used to estimate standard error for the estimation of shares in one country, i.e., Spain (ES). In the example, shares of categories are calculated for two categorical variables:

  • Categorical variable YOUNG_FARM (young farmers), which has 2 categories: ‘1’ (yes) for units for which AGE_MANAGER < 40, and ‘0’ (no) for other units.
  • Categorical variable OLDER_MAN (older farmers), which has 2 categories: ‘1’ (yes) for units for which AGE_MANAGER > 55, and ‘0’ (no) for other units.

Shares are calculated for the level of the whole country and for the level of NUTS2 region (REGIONS). Since all variables, used for calculation of shares, are CORE variables, PROC SURVEYFREQ uses the Extrapolation factor of the CORE as weight.

**********************************************************************

/*Determine the extrapolation factor and the coverage of the table

**********************************************************************;

Data micro;

   set IFSORA.IFS_T_MAIN_2020 (keep=YEAR COUNTRY HLD_ID STRA_ID_CORE STRA_IDF_CORE EXT_CORE PSU_CORE OSU_S1_CORE PSU_CORE OSU_S1_CORE AGE_MANAGER NUTS2);

     ** Keep only data for ES;

     where country in ('ES');

     FORMAT REGIONS $32. WEIGHT 16.4 OLDER_MAN $1. YOUNG_MAN $1. ;

     ** EXT_CORE is a derived variable calculated in Eurostat as the product of EXTPOL_FACT1_CORE, EXTPOL_FACT2_CORE, EXTPOL_FACT3_CORE, where null values of EXTPOL_FACT2_CORE and EXTPOL_FACT3_CORE are replaced by 1 for the calculation of the product;

     WEIGHT = EXT_CORE ;

     IF OSU_S1_CORE= . THEN OSU_S1_CORE = 1 ;

     IF STRA_IDF_CORE ='_Z' THEN STRA_ID_CORE= 1 ;

     IF PSU_CORE = . THEN PSU_CORE = HLD_ID ;

     ***  Null values of PSU_CORE, STRA_ID_CORE and OSU_S1_CORE must be replaced with relevant values in order to avoid exclusion in SURVEYFREQ;

     IF AGE_MANAGER < 40 then YOUNG_MAN ='1' ;

     IF AGE_MANAGER >= 40 then YOUNG_MAN ='0';

     IF AGE_MANAGER > 55 then OLDER_MAN ='1' ;

     IF AGE_MANAGER <= 55 then OLDER_MAN ='0';

     *** Derivation of two categorical variables, for which the shares of categories will be calculated;

     ** YOUNG_MAN: shares of managers that are younger than 40 years;

     ** OLDER_MAN: shares of managers that are older than 55 years; 

     REGIONS = NUTS2 ;

     _ONEC='1';

     if weight>0 then Iweight=1;

     else Iweight=0;

     IF Year>0 and not missing (AGE_MANAGER) ;

RUN ;

*****************************************;

*PROC SURVEYFREQ;

******************************************;

** Construction of computational strata, depending on OSU_S1_ and PSU_ ;

PROC SQL;

     CREATE TABLE _CST1 AS SELECT COUNTRY, STRA_ID_CORE ,PSU_CORE, MIN(OSU_S1_CORE) AS OSU_S1_CORE

     FROM MICRO

     GROUP BY COUNTRY, STRA_ID_CORE ,PSU_CORE;

QUIT;


PROC SORT DATA=_CST1;

     BY COUNTRY STRA_ID_CORE OSU_S1_CORE;

RUN;


***Within each formal stratum STRA_ID_:

The first record of each formal stratum receives _SEQ=1 then the following records of the same stratum receive _SEQ = an incrementing number by 1.

If OSU_S1 (the rank of systematic sampling) is incrementing (so a systematic sampling is used) and the record has _SEQ>2 (so the record is not the first and not the second, these first 2 records form the first computational stratum), then the formal stratum STRA_ID_  gets split i.e. a new computational stratum _CST is created within the formal stratum STRA_ID_. The process is iterative (DO - END). The number of records for each computational stratum is 2 (defined by _SEQ>2).

The split is not done before a record if that record is the last one in the formal stratum. This is done in order to avoid that the last computational stratum has less than 2 records. ;


DATA DATA_CST;

     SET _CST1;

     RETAIN _CST 1 _SEQ;

     BY COUNTRY STRA_ID_CORE ;

     P_OSU_S1_CORE=LAG(OSU_S1_CORE);

     IF FIRST.STRA_ID_CORE  THEN DO;

          _SEQ=1;

          P_OSU_S1_CORE=.;

          _CST=1;

     END;

     ELSE _SEQ+1;

     IF OSU_S1_CORE>P_OSU_S1_CORE AND _SEQ>2 AND NOT(LAST.STRA_ID_CORE) THEN DO;

          _CST+1;

          _SEQ=1;

     END;

     DROP _SEQ P_OSU_S1_CORE;

RUN;


** Join computational strata (_CST) back to micro-data;


PROC SQL;

     CREATE TABLE MICRO AS

     SELECT A.* , B._CST

     FROM MICRO A

     LEFT JOIN DATA_CST B ON ((A.STRA_ID_CORE= B.STRA_ID_CORE) AND (A.PSU_CORE = B.PSU_CORE) AND (A.COUNTRY=B.COUNTRY));


     *** Calculate sampling rate (_RATE_);

    

     CREATE TABLE MICRO AS SELECT *,

     MIN(SUM(IWEIGHT)/SUM(WEIGHT),1) AS _RATE_

     FROM MICRO

     GROUP BY COUNTRY, STRA_ID_CORE,_CST;

QUIT;


 ODS EXCLUDE ALL;

 PROC SURVEYFREQ DATA=MICRO RATE=MICRO ;

      BY COUNTRY;

      STRATA STRA_ID_CORE _CST;

      WEIGHT WEIGHT;

      Cluster PSU_CORE;

     ** Variables that are used as Domains (_ONEC for the whole country) and REGIONS must be stated first if we use RowPercent ;

      TABLE _ONEC *(OLDER_MAN YOUNG_MAN) REGIONS*(OLDER_MAN YOUNG_MAN)/ ROW;

      ods output CrossTabs=VAR_ST1;

 RUN;

 ODS EXCLUDE NONE;


** The appropriate results are taken from the PROC SURVEYFREQ output;


 data VAR_ST(keep=Country REGIONS SHARE_DOMAIN Category stderr);

      set VAR_ST1;

      Format SHARE_DOMAIN $32. Category $250. stderr 16.8;

      array _MV OLDER_MAN YOUNG_MAN;

      array _DIM _onec REGIONS;

      do i =1 to dim(_MV);

           if _MV(i) ne '' AND CMISS(of _DIM[*]) NE DIM(_DIM) and RowStdErr>0 then do;

                SHARE_DOMAIN=vname(_MV(i));

                Category=_MV(i);

                stderr=RowStdErr;

               Output;

          end;

      end;

 run;


Variance estimation in IFS and FSS for European aggregates 

For shares, Eurostat decided to use the same formula as for proportions. This decision was taken to simplify calculations. Eurostat performed an analysis which showed that there are no significant differences between the standard errors obtained using the formula for proportions and the standard errors obtained using the formula for ratios (see section "Ratios", sub-section  "Variance estimation in IFS and FSS for European aggregates").

Where:
 is the VARPROP at country level (REGION=”COUNTRY”), for the same combination of dimension values as the EU aggregate in question. For example, if the line of the EU aggregate has dimensions FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”, then the formula includes the 27 VARPROPs corresponding to the lines where REGION=”COUNTRY”, FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”.

 is the TOTAL_WGT at country level (REGION=”COUNTRY”), for the same combination of dimension values as the EU aggregate in question. For example, if the line of the EU aggregate has dimensions FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”, then the formula includes the 27 TOTAL_WGTs corresponding to the lines where REGION=”COUNTRY”, FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”.


RATIOS


RATIOS

Variance estimation in IFS

In Integrated Farm Statistics the dataset fields offer the possibility to record information allowing for basically all types of sampling designs, including those more complex than one-stage stratified random sampling.

Eurostat uses the ultimate cluster approximation approach to estimate variance. The formulae apply for all sampling designs and they also apply for censuses. In case of a census, the weights are equal to 1 or slightly different from 1 (as consequence of non-response adjustment and calibration).

Eurostat considers the use of systematic sampling through the calculation of "computational strata" based on the values in the OSU_S1_* fields.  * stands for CORE, LAFO, RDEV, etc.  It does so in the SAS syntaxes, before applying the formula through the proc surveymeans procedure.  Actually, in order to reduce the computation time:

  • only for the countries with field OSU_S1_* different from null, the SAS programme contains the whole set of SAS syntaxes presented below, including the computation and use of the field _CST (computational stratum) in the proc surveymeans procedure. 
  • for the rest of the countries (with field OSU_S1_* null), the SAS programme is reduced, and does not include the computation and use of the field _CST (computational stratum) in the proc surveymeans procedure. 

Also in order to reduce the computation time to a reasonable level:

  • where the number of strata is very high, Eurostat exceptionally computes the variance by regions and not by country. For 2020, this was the case for EL. This assumes that strata are nested into regions. 
  • where the data do not have the PSU_* field completed, Eurostat removes the option "Cluster" from the proc surveymeans procedure. This was the case for 2020 for all countries.

ULTIMATE CLUSTER APPROXIMATION with stratification at the first stage

Variance and coefficient of variation of the ratio in domain d (general case, where weights may be different within a stratum as consequence of non-response adjustment and calibration)

 (SAS manual, surveymeans, pages 9284-9285)

SAS syntaxes

Variance estimation at country level

We provide example of SAS code that is used to estimate standard error and coefficient of variation for the estimation of ratio in one country, i.e. Latvia (LV). In our example ratio is livestock density index (LSI), defined as: LSI=LSU/UAA , where LSU is total livestock unit and UAA is total Utilised agricultural area in the respective domain. Ratios are calculated for the level of the whole country and for the level of NUTS2 region (REGIONS). Since both variables are CORE variables, PROC SURVEYMEANS uses the Extrapolation factor of the CORE as weight.

**********************************************************************

/*Determine the extrapolation factor and the coverage of the table

**********************************************************************;

DATA MICRO;

     SET IFSORA.IFS_T_MAIN_2020(KEEP=YEAR COUNTRY HLD_FEF HLD_ID STRA_ID_CORE STRA_IDF_CORE EXT_CORE PSU_CORE OSU_S1_CORE A6710R A0030 A2010 A2120 A2220 A2130 A2230 A2300F A2300G A4100 A4200 A3110 A3120 A3130 A5140 A5110O A5230 A5210 A5220 A5410 A5240_5300 A5000X5100 A6111 NUTS2 UAA);

** EXT_CORE is a derived variable calculated in Eurostat as the product of EXTPOL_FACT1_CORE, EXTPOL_FACT2_CORE, EXTPOL_FACT3_CORE, where null values of EXTPOL_FACT2_CORE and EXTPOL_FACT3_CORE are replaced by 1 for the calculation of the product.

     ** Keep only data for LV;

     WHERE COUNTRY IN ('LV');

     FORMAT LSU_DER 16.4 OSU_S1_CORE 16. PSU_CORE 16. REGIONS $32. STRA_ID_CORE 16. WEIGHT 16.4 ;

     ***  Null values of PSU_CORE, STRA_ID_CORE and OSU_S1_CORE must be replaced with relevant values in order to avoid exclusion in SURVEYMEANS;

     IF PSU_CORE = . THEN PSU_CORE = HLD_ID ;

     IF STRA_IDF_CORE ='_Z' THEN STRA_ID_CORE= 1 ;

     IF OSU_S1_CORE= . THEN OSU_S1_CORE = 1 ;

     REGIONS = NUTS2 ;

     _ONE=1;

     ** Derivation of LSU;

     LSU_DER = SUM((A2010*0.4), (A2120*0.7),(A2220*0.7),     (A2130*1),(A2230*0.8),(A2300F*1),(A2300G*0.8),(A4100*0.1),(A4200*0.1), (A3110*0.027),  (A3120*0.5),(A3130*0.3),(A5140*0.007), (A5110O*0.014),(A5000X5100*0.03), (A6111*0.02)) ;

     ** Determine the right extrapolation factor;

     WEIGHT = EXT_CORE ;

     ** Variable, which is used to determine sampling rate;

     IF WEIGHT>0 THEN IWEIGHT=1;

     ELSE IWEIGHT=0;

     IF LSU_DER>0 AND UAA>0 AND WEIGHT>0;

RUN;


*****************************************;

*PROC SURVEYMEANS;

******************************************;


** Construction of computational strata, depending on OSU_S1_ and PSU_ ;


PROC SQL;

     CREATE TABLE _CST1 AS SELECT COUNTRY, STRA_ID_CORE ,PSU_CORE, MIN(OSU_S1_CORE) AS OSU_S1_CORE

     FROM MICRO

     GROUP BY COUNTRY, STRA_ID_CORE ,PSU_CORE;

QUIT;


PROC SORT DATA=_CST1;

     BY COUNTRY STRA_ID_CORE OSU_S1_CORE;

RUN;


***Within each formal stratum STRA_ID_:

The first record of each formal stratum receives _SEQ=1 then the following records of the same stratum receive _SEQ = an incrementing number by 1.

If OSU_S1 (the rank of systematic sampling) is incrementing (so a systematic sampling is used) and the record has _SEQ>2 (so the record is not the first and not the second, these first 2 records form the first computational stratum), then the formal stratum STRA_ID_  gets split i.e. a new computational stratum _CST is created within the formal stratum STRA_ID_. The process is iterative (DO - END). The number of records for each computational stratum is 2 (defined by _SEQ>2).

The split is not done before a record if that record is the last one in the formal stratum. This is done in order to avoid that the last computational stratum has less than 2 records. ;



DATA DATA_CST;

     SET _CST1;

     RETAIN _CST 1 _SEQ;

     BY COUNTRY STRA_ID_CORE ;

     P_OSU_S1_CORE=LAG(OSU_S1_CORE);

     IF FIRST.STRA_ID_CORE  THEN DO;

          _SEQ=1;

          P_OSU_S1_CORE=.;

          _CST=1;

     END;

     ELSE _SEQ+1;

     IF OSU_S1_CORE>P_OSU_S1_CORE AND _SEQ>2 AND NOT(LAST.STRA_ID_CORE) THEN DO;

          _CST+1;

          _SEQ=1;

     END;

     DROP _SEQ P_OSU_S1_CORE;

RUN;


** Join computational strata (_CST) back to micro-data;


PROC SQL;

     CREATE TABLE MICRO AS

     SELECT A.* , B._CST

     FROM MICRO A

     LEFT JOIN DATA_CST B ON ((A.STRA_ID_CORE= B.STRA_ID_CORE) AND (A.PSU_CORE = B.PSU_CORE) AND (A.COUNTRY=B.COUNTRY));


     *** Calculate sampling rate (_RATE_);

    

     CREATE TABLE MICRO AS SELECT *,

     MIN(SUM(IWEIGHT)/SUM(WEIGHT),1) AS _RATE_

     FROM MICRO

     GROUP BY COUNTRY, STRA_ID_CORE,_CST;

QUIT;

** SURVEYMEANS procedure;

** Variance is calculated for ratio as well for sums in numerator and denominator ;

** This is needed for calculation of covariance, which is later needed for calculation of variance on EU level;


ODS EXCLUDE ALL;


PROC SURVEYMEANS DATA=MICRO RATE=MICRO STDERR SUMWGT SUM RATIO VARSUM;

     WHERE WEIGHT>0;

     BY COUNTRY;

     STRATA STRA_ID_CORE _CST;  WEIGHT WEIGHT;

     Cluster PSU_CORE;

     VAR UAA LSU_DER;

     RATIO LSU_DER/UAA;

     DOMAIN _ONE REGIONS;

     ODS OUTPUT DOMAINRATIO=VARRAT(RENAME=(NUMERATORNAME=LS_DENSITY DENOMINATORNAME=LS_DENSITY_DEN) DROP=NUMERATORLABEL  DENOMINATORLABEL);

     ODS OUTPUT DOMAIN=VAR_NUM( WHERE =(UPCASE(COMPRESS(LS_DENSITY))=UPCASE(COMPRESS("LSU_DER"))) DROP=VARLABEL RENAME=(VARNAME=LS_DENSITY VARSUM=VARSUM_NUM SUM=SUM_NUM));

     ODS OUTPUT DOMAIN=VAR_DEN( WHERE =(UPCASE(COMPRESS(LS_DENSITY))=UPCASE(COMPRESS("UAA"))) DROP=VARLABEL RENAME=(VARNAME=LS_DENSITY VARSUM=VARSUM_DEN SUM=SUM_DEN));

RUN;


ODS EXCLUDE NONE;


*** Put all 3 outputs together;


PROC SORT DATA=VARRAT;

     BY COUNTRY REGIONS;

RUN;


PROC SORT DATA=VAR_NUM;

     BY COUNTRY REGIONS;

RUN;


PROC SORT DATA=VAR_DEN;

     BY COUNTRY REGIONS;

RUN;


DATA VAR_ST;

     MERGE VARRAT(IN=A) VAR_NUM(KEEP=COUNTRY REGIONS VARSUM_NUM) VAR_DEN(KEEP=COUNTRY REGIONS VARSUM_DEN SUM_DEN);

     BY COUNTRY REGIONS ;

     IF A;

     VARRATIO=STDERR**2;

     CVRATIO=RATIO/STDERR;

     COVAR=(VARSUM_NUM+RATIO**2*VARSUM_DEN-SUM_DEN**2*STDERR**2)/(2*RATIO);

RUN;

Variance estimation in IFS and FSS for European aggregates 


Where:

is the Variance of the Ratio at country level (REGION=”COUNTRY”), for the same combination of dimension values as the EU aggregate in question. For example, if the line of the EU aggregate has dimensions FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”, then the formula includes the 27 VARRATIOs corresponding to the lines where REGION=”COUNTRY”, FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”.

is the Variance of the numerator of Ratio at country level (REGION=”COUNTRY”), for the same combination of dimension values as the EU aggregate in question. For example, if the line of the EU aggregate has dimensions FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”, then the formula includes the 27 VARSUMs corresponding to the lines where REGION=”COUNTRY”, FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”.

is the Variance of the denominator of Ratio at country level (REGION=”COUNTRY”), for the same combination of dimension values as the EU aggregate in question. For example, if the line of the EU aggregate has dimensions FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”, then the formula includes the 27 VARSUMs corresponding to the lines where REGION=”COUNTRY”, FARMTYPE=”FT23_SO”, SO_EUR=”KE15-24”, AGRAREA=”HA5-9”.

Considerations on sampling design and extrapolation factors

The estimation of variance takes into account the sampling design information and the extrapolation factor fields.

The SAS procedures require indicating the parameters related to the sampling design and extrapolation factors.

The SAS procedures also require indicating the population domain for which variance is estimated. A domain is a subgroup of the whole population for which specific estimates are needed. A domain may consist of a geographical area, such as a NUTS2 region, or a specified population breakdown such as the agricultural holdings whose managers are between 25 and 34 years old. According to the handbook on precision requirements and variance estimation for ESS households surveys, domains are of two types:
  • planned domains:
  • unplanned domains

Planned domains

The planned domains are separate strata from which independent samples are taken. Stratification ensures a satisfactory level of representativeness of the planned domain in the final sample. For example, a planned domain is a NUTS2 region, when the sample is stratified by NUTS2 regions.

Unplanned domains

The unplanned domains are not separate strata of the sampling design. The statistician cannot control the size of the sample falling in an unplanned domain, which is needed to ensure a certain level of precision. For example, an unplanned domain consists of all agricultural holdings whose managers are between 25 and 34 years old, when the sample is not stratified by age of managers of agricultural holdings.

The precision of estimates over unplanned domains can be improved by post-stratification. However, bias can be introduced at the same time.

The random size of the sample in an unplanned domain creates an additional component of variability into the domain estimates.

The SAS procedures take into consideration both the strata and the domains. In case they coincide, then the same parameter is indicated for strata and for domain.

Systematic sampling with implicit stratification

Some countries use systematic sampling within (formal) strata.

Systematic sampling (as opposed to random sampling) entails implicit stratification when holdings are ordered within strata by a variable correlated with the main variables of interest. If implicit stratification is disregarded in the variance estimation, then the variance is overestimated.

There is no unbiased variance estimator in the case of systematic sampling. There are several possible options to use variance estimators that are (under certain assumptions) close to being unbiased.

The option Eurostat uses is to explicitly define small strata - usually called 'computational strata' - in the dataset. Eurostat creates a new variable called 'computational stratum' in the dataset on the basis of the information provided by countries in the field OSU_S1_*. For example, suppose that a country stratifies by size class (using the standard output), and within each size class, the country uses systematic sampling by ordering the holdings by UAA. Then, Eurostat defines the 'computational ' strata by pairing the holdings by UAA within each original strata constructed by size class, ensuring that each 'computational' stratum has at least two holdings. The relation between original strata and 'computational' strata is one to many. The variance is therefore calculated using the information from the 'computational' strata, or from the updated strata field.

Updated allocation of holdings to strata

Some holdings change attributes related to stratification between the sampling design and the reference period.

he holdings changing strata (as well as the non-respondents) can significantly distort the population structure and consequently lead to bias in the final results. The most common way to deal with the change of strata is to implement some sort of the calibration procedure, like post-stratification. This procedure takes the original extrapolation factors (according to the original strata) and adjusts them in such way that the population structure is reflected properly. There is one exception to this rule that is however frequently used in practice. In the cases when a take-all stratum is used during the sample selection procedure, the holdings that were not included into the take-all stratum, but after data collection are found to be large holdings, are usually changed into self-representative holdings for weighting purposes, meaning that their extrapolation factor is changed to 1.

The following approach is proposed:

  • determine the sampling extrapolation factors (sampling weights);
  • detect the large holdings that were sampled in small strata and re-allocate them into take-all strata, assigning them extrapolation factor 1;
  • adjust the extrapolation factors for the unit non-response;
  • calibrate the extrapolation factors by taking into account the population structure;

  • for variance estimation, consider the final extrapolation factors of the responding holdings and the initial sampling design with the original (where relevant computational strata), except that the holdings which are reallocated to the take-all stratum should be considered in the new (take-all) stratum.

Ineligible holdings

The ineligible holdings are not part of the relevant (target) population and their values are disregarded in the point estimation. Consistently, the values of ineligible holdings should be disregarded in variance estimation, too.

Eligible holdings with value 0 for a target variable

When estimating variance for a target variable (e.g. flowers and ornamental plants), the 0 values for that variable, recorded for holdings which belong to the relevant (target) population should be considered. 0 values are observed values (e.g. those holdings do not cultivate flowers and ornamental plants).

Eligible holdings with outlier values

If outlier values are taken into account in the estimation of the data, they should also be taken into account when estimating the variance for the data. Sometimes the extrapolation factor of the outlier is set to 1 (the outlier is changed to a self-representative holding) or is trimmed. Also in this case the outlier value should be taken into account, of course with the adjusted extrapolation factor. Only when the outlier value is excluded, it is reasonable that its value is not taken into account when estimating variance.

Imputation

The imputation procedure creates additional variability which should be considered when estimating variance. When imputed values are treated as observed values, the variance of the estimator is underestimated.

Variance estimation in case of imputation is very widely discussed and theoretically elaborated, but the theoretical solutions are very rarely implemented in practice. Several different procedures have been developed for variance estimation in the presence of imputation, but they are quite rarely used in the practice of official statistics. There are several reasons for that, the most distinctive probably the fact that these approaches are theoretically quite demanding and are not easy to implement on a general level. Moreover, there is no standard software for that purpose. Generally speaking, the following options can be used:

  1. Treat imputed values as reported values. This approach is acceptable when the rate of imputed values is small (e.g. 1-2%), but can lead to a serious underestimation of the variance in the case of high imputation rate.
  2. Employ one of the procedures described in the literature and correctly estimate variance due to imputations.
  3. Disregard the imputed values and consider only the observed values in the variance estimation. This approach would (in most cases) lead to overestimation of the variance.

Options 2 and 3 require transmitting additional field for each variable marking the imputed values, and some countries do not keep track of the imputed values. Taking this into consideration but also the demanding efforts for implementation of variance estimation procedure in the presence of imputation it was concluded at the Working Group in 2017 to further analyse the problem.

Calibration

Calibration variability should be included in the estimated variance.

Correct procedures for variance estimation when calibration is used, are quite complex and not easy to implement in practice. In order to correctly estimate the variance, Eurostat needs either the residuals of the regression between the variable of interest and the calibration variables or the sample values of the calibration variables (so as to calculate the residuals). This should be the case for each variable of interest and therefore this would mean significant additional burden.

It is proposed that Eurostat takes a pragmatic approach which considers the calibrated extrapolation factors, the initial sampling design and not in addition the residuals. In general, over all statistical domains, many countries implement this approach. Eurostat commissioned a simulation study which concludes that bias that is introduced into the variance estimation by this approach is not significant. The study is based on Monte Carlo repeated sample simulations which consider different kinds of sample non-response patterns (being known that calibration is meant to reduce bias and variance caused by non-response). The study compares the Monte Carlo variance ("true" variance) with the variance of the Taylor linearization estimator of calibrated total ("the classical approach") not using the residuals and with the variance of the Taylor linearization estimate of calibrated total using the residuals. Although there are cases when "the classical approach" is overestimating the real variance much more than the "residual's estimator", there are also vice-versa cases and the study cannot conclude that the residuals approach generally works better. It is also worth noticing that in the case of larger variance, which is important from the point of view of detection of non-reliable results, both estimators give similar results. Therefore, the "classical approach" can be used, with no major reservations. Countries are encouraged of course to use estimators incorporating the residuals' effect.  


Quality rating system

This section presents the quality rating system which was adopted in the 2016 Working Group . The quality rating system guides which estimates should be:

•disseminated without warning,

•disseminated with warning,

•suppressed,

for all types of population breakdowns (geographical or not).


The quality rating system is two-fold:

  1. for total of continuous variables, the system is based on the values of coefficients of variation:
  • below 24.99% - estimates are released
  • 25.0% - 34.99% - estimates are released with warning, to be used with caution
  • 35.0% and more - too unreliable, estimates should not be released

Where totals are zero in tables, the corresponding coefficients of variations cannot be technically computed (because they are obtained dividing by zero) and are calculated null (.). The quality rating system classifies these cases as "below 24.99%", which means the zero totals are published.

  1. for proportions, the system is based on the values of standard errors:
  • below 12.49 percentage points - estimates are released
  • 12.5 percentage points - 17.49 percentage points - estimates are released with warning, to be used with caution
  • 17.5 percentage points and more – too unreliable, estimates should not be released

Where proportions are zero in tables, the corresponding calculated standard errors are zero. The quality rating system classifies these cases as "below 12.49 percentage points" which means the zero counts (corresponding to the zero proportions) are published.

Counts are first converted to proportions , standard errors are estimated for the corresponding proportions and the quality rating system for proportions is applied. The quality rating system is consistent with the one used by Statistics Canada for Farm Management Survey.
  • No labels