Page tree

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 53 Next »

Regulation (EU) 2018/1091 states that "the Commission is to respect the confidentiality of the data transmitted in line with Regulation (EC) No 2232009 of the European Parliament and of the Council. The necessary protection of confidentiality of data should be ensured, among other means, by limiting the use of the location parameters to spatial analysis of information and by appropriate aggregation when publishing statistics. For that reason a harmonised approach for the protection of confidentiality and quality aspects for data dissemination should be developed, while making efforts to render online access to official statistics easy and user-friendly".

Regulation (EC) No 223/2009 Article 3 'Confidential data' means data which allow a statistical unit (i.e. the person, company or organisation to which the data refers) to be identified, either directly or indirectly, thereby disclosing individual information. To determine whether a statistical unit is identifiable, account shall be taken of all relevant means that might reasonably be used by a third party to identify the statistical unit.

The risk of a statistical unit being identified is the only factor that qualifies data as confidential. It is not important which information is disclosed and if this information is sensitive or not. In this light, one cannot argue that some variables (e.g. crops, livestock) are less sensitive than others (labour force).

GDPR


On the 8th of February 2018 the Directors-General and Presidents of the National Statistical Institutes (NSIs) and of the European Union's statistical authority (Eurostat) met at an informal workshop on the implications of the GDPR in European statistics and the following conclusions were issued:

  1. acknowledged the high relevance of the GDPR implementation for the production of high quality official statistics and for maintaining the confidence of the respondents providing personal data for statistical purposes;
  2. recognised that in almost all Member States procedures have been initiated to enact derogations from the data subjects' rights referred to in some or all of the following Articles of the GDPR: 15 (access), 16 (rectification), 18 (restriction) and 21 (objection);
  3. agreed that the same derogations should apply across all statistical domains and should not be domain-specific;
  4. acknowledged that the NSIs and other statistical authorities (ONAs) are responsible for the protection of all personal data they process, both those collected in the framework of an EU regulation and those collected for purely national interests;
  5. noted that appropriate derogations in national law, when granted, could in the most cases be sufficient to effectively address the potential ramifications of the GDPR and the specific needs of the statistical production in each Member State;
  6. agreed that, in the interest of harmonising the protection of the data subjects' rights in the field of official statistics, additional uniform derogations at EU level, notably in Regulation 2232009, could be useful and should be considered once enough experience in the application of the GDPR has been collected; in this respect discussion at expert level should be organised at a later stage;
  7. agreed to share experience and best practice in addressing the implications of the GDPR for official statistics at the national level; to this end, a collaborative platform will be created by Eurostat to store and share examples of national provisions and justifications for derogations;
  8. emphasised the need to establish constructive dialogue with data protection authorities at national and European level in order to clarify the specificities of statistical production, including a better understanding of statistical methodology and existing safeguards.

Data storage and dissemination

To be developed

Confidentiality measures for microdata

See 9.5.3 - Scientific use files

Confidentiality measures for location

In the presence of geospatial data, disclosure control experts must face a paradox. On the one hand, such data need more protection because they allow more identification, and on the other hand they offer many possibilities for analysis, that users don't want to distort too much by suppressing data. Disclosure risk is higher when considering geospatial data:

  • firstly, because belonging to a geographical area may give information to the intruder about some attributes (e.g. 100 percent of inhabitants of a square are unemployed). This is called categorisation risk, and it increases in the case of spatial data because of Tobler's "first law of geography" which states that "everything interacts with everything, but two close objects are more likely to do so than two distant objects";
  • secondly, because of so-called identification risk. Indeed, among the characteristics shared with someone, a common geographic area leads to a higher probability of identifying the person (one probably knows better our neighbour than someone who one shares any other characteristic with). Moreover, identification of addresses has recently become possible with the development of open access tools like Google Street View. As a result, population density is a fundamental predictor of disclosure risk: the lower the density, the higher the disclosure risk. That is why confidentiality thresholds can differ between countries;
  • finally, disclosure risk can increase with the geographic differencing issue, when data is disseminated at different levels (hierarchical or not).

Technically, the dissemination classification (zoning, administrative boundaries, or regular tessellations such as grid squares) is a categorical variable like any another one (an additional dimension of tabular data). It is therefore possible to deal with disclosure risk with no geographical consideration. Nevertheless, a geographically intelligent management of disclosure issues will preserve the underlying spatial phenomenon. A risk-utility compromise has to be made, using relevant distortion indicators (EFGS & Eurostat, 2017). The risk for identifying holdings by crossing census gridded data with the proposed scientific use files (see 9.5.3 - Scientific use files) is close to zero. This is due to the following facts:

  • Gridded information does not include the number of holdings explaining each characteristic. Only aggregated number of holdings is provided
  • Gridded information is tabular information; it represents more than one holding; it is therefore treated with the standard method for disclosure control as any other tabulation, using the following algorithm:
    • If the value of the cell is explained by 4 or less holdings, or if more than 85% of the value is explained by 1 or 2 holdings, then the information is not disclosed
    • the minimum value that a user will observe is 10 holdings
    • the minimum observed total data in any variable that is not disclosed in the grids due to disclosure control represents 10% of the total of the variable in the EU (a strict disclosure control algorithm is used)
  • The method used for locating the holding has a high uncertainty. It is not guaranteed that the holding is actually located in the grid cell where it is shown. This is due to:
    • Coordinates are rounded to a 10km INSPIRE grid.
    • Holdings are represented as points, while they represent polygons; in the case of big farms, they are present in more than one grid, but only one X,Y coordinate pair represents the holding

Figure 65 – Agricultural holding density (number of farms per square Km of UAA)


In order to protect the confidentiality in case of very large holdings, when it is possible that only one farm exists in one of the cells of the grid, it will be possible to allocate the position of a farm to the nearest neighbouring cell with at least one other holding. If none of the 8 neighbouring cells (chosen in random order) has at least one other holding, the neighbouring locations have to be extended until a grid cell is located. As much as possible the chosen cell should be such that the location is within the same NUTS3 region of the original cell. A cell is considered to belong to a NUTS3 region if the lower-left coordinate is inside the polygon that defines the NUTS3 region at the 1:100.000 scale.

Figure 66 – If only one farm at a location, assign it to a random neighbouring cell within the same NUTS3; if still not possible, enlarge the area. 

Multi-resolution grids

Multi resolution grids are represented by a hierarchical structure through two associations. Each StatisticalGrid instance can be associated with a lower and/or an upper resolution grid through the Hierarchical relation association. A StatisticalGridCell belonging to a given StatisticalGrid is composed of the overlapping cells its grid's lower resolution grid, and composes the cell it overlaps in its grid's higher resolution grid. Lower and upper StatisticalGridCells are associated through the Hierarchical composition. Figure 71 – INSPIRE Grid

Source: https://inspire.ec.europa.eu/id/document/tg/su 

Confidentiality for tabular data

Eurostat disseminates a high number of statistical tables on its website and through specific user requests. All these tabular data are treated for primary confidentiality. Primary confidentiality concerns tabular cell data, whose dissemination would permit attribute disclosure. The two main reasons for declaring data to be primary confidential are: too few contributors in a cell and dominance of the first n largest contributors in a cell .


In the tables disseminated on Eurostat website, a cell is confidential if:

  • the (extrapolated) number of holdings that contribute to the cell (extrapolated) value is less than or equal to a certain value and/or
  • the n largest (extrapolated) holdings represent more than a certain percentage k% of the cell (extrapolated) value.

A confidential value is replaced with ":c".

For non-confidential cells, the extrapolated number of holdings and all values of characteristics in cells are rounded to the closest multiple of 10.

Because of the confidentiality treatment, the sum of the individual cells does not systematically match with the value of the "total" cell.

The Eurobase tables in SAS present the standard FLAG_CODE=A for the cells that are to be suppressed because of the threshold rule. In the code list on "confidentiality status" defined by the SDMX Statistical Working Group ,, the flag A stands for "Primary confidentiality due to small counts" .

Limitations to the application of secondary confidentiality

As mentioned in Table 17, secondary confidentiality would need to be implemented for several tables at the same time. This may not be feasible, considering:

  • the numerous tables disseminated on Eurostat website and through ad-hoc requests;
  • that the whole publication programme should be reviewed in an integrated way. When a new table is created, the other tables need updating.
  • the suppression of cells should have the same pattern for different reference periods. A value for a confidential cell from other reference periods is usually a good basis for estimating and therefore disclosing data.

It is to be noted that:

  • The rounding to the nearest multiple of 10 would prevent recalculation of the exact values to some extent.
  • When data are estimated from a sample, estimated values deviate from the true values, which would additionally prevent recalculation of the exact values to some extent.
Eurostat made an analysis of the possibility for recalculations and found in 1993 that:
  • most of the derived values are not reliable; there are negative solutions and solutions outside an interval of ±50% in relation to the true value;
  • a few derivations come very close to the real value.

It was concluded not to suppress data in cells, because "the procedure involves iterations of treatment of derivation while no possibility of derivation can be realistically excluded. It also entails loss of data involving no risk of disclosure". Besides the methods already implemented (suppression, rounding) or discussed in the above table, there are other methods e.g. table redesign (collapsing rows/columns), controlled tabular adjustment (selectively adjust cell values: unsafe cells are replaced by either of their closest safe values; other cell values are adjusted to restore additivity), perturbation (add random noise to cell values). Application of methods have pros and cons. For deciding on the most suitable solution, a balance has to be struck between confidentiality and reliability i.e. to which point the confidentiality treatment is effective and does not jeopardize the accuracy and usability of the results, unnecessarily.




  • No labels