Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

See 9.5.3 - Scientific use files

Confidentiality measures for location

In the presence of geospatial data, disclosure control experts must face a paradox. On the one hand, such data need more protection because they allow more identification, and on the other hand they offer many possibilities for analysis, that users don't want to distort too much by suppressing data. Disclosure risk is higher when considering geospatial data:

  • firstly, because belonging to a geographical area may give information to the intruder about some attributes (e.g. 100 percent of inhabitants of a square are unemployed). This is called categorisation risk, and it increases in the case of spatial data because of Tobler's "first law of geography" which states that "everything interacts with everything, but two close objects are more likely to do so than two distant objects";
  • secondly, because of so-called identification risk. Indeed, among the characteristics shared with someone, a common geographic area leads to a higher probability of identifying the person (one probably knows better our neighbour than someone who one shares any other characteristic with). Moreover, identification of addresses has recently become possible with the development of open access tools like Google Street View. As a result, population density is a fundamental predictor of disclosure risk: the lower the density, the higher the disclosure risk. That is why confidentiality thresholds can differ between countries;
  • finally, disclosure risk can increase with the geographic differencing issue, when data is disseminated at different levels (hierarchical or not).

Technically, the dissemination classification (zoning, administrative boundaries, or regular tessellations such as grid squares) is a categorical variable like any another one (an additional dimension of tabular data). It is therefore possible to deal with disclosure risk with no geographical consideration. Nevertheless, a geographically intelligent management of disclosure issues will preserve the underlying spatial phenomenon. A risk-utility compromise has to be made, using relevant distortion indicators (EFGS & Eurostat, 2017). The risk for identifying holdings by crossing census gridded data with the proposed scientific use files (see 9.5.3 - Scientific use files) is close to zero. This is due to the following facts:

  • Gridded information does not include the number of holdings explaining each characteristic. Only aggregated number of holdings is provided
  • Gridded information is tabular information; it represents more than one holding; it is therefore treated with the standard method for disclosure control as any other tabulation, using the following algorithm:
    • If the value of the cell is explained by 4 or less holdings, or if more than 85% of the value is explained by 1 or 2 holdings, then the information is not disclosed
    • the minimum value that a user will observe is 10 holdings
    • the minimum observed total data in any variable that is not disclosed in the grids due to disclosure control represents 10% of the total of the variable in the EU (a strict disclosure control algorithm is used)
  • The method used for locating the holding has a high uncertainty. It is not guaranteed that the holding is actually located in the grid cell where it is shown. This is due to:
    • Coordinates are rounded to a 10km INSPIRE grid.
    • Holdings are represented as points, while they represent polygons; in the case of big farms, they are present in more than one grid, but only one X,Y coordinate pair represents the holding

...

Figure 65 – Agricultural holding density (number of farms per square Km of UAA)

Image Removed

In order to protect the confidentiality in case of very large holdings, when it is possible that only one farm exists in one of the cells of the grid, it will be possible to allocate the position of a farm to the nearest neighbouring cell with at least one other holding. If none of the 8 neighbouring cells (chosen in random order) has at least one other holding, the neighbouring locations have to be extended until a grid cell is located. As much as possible the chosen cell should be such that the location is within the same NUTS3 region of the original cell. A cell is considered to belong to a NUTS3 region if the lower-left coordinate is inside the polygon that defines the NUTS3 region at the 1:100.000 scale.

...

Figure 66 – If only one farm at a location, assign it to a random neighbouring cell within the same NUTS3; if still not possible, enlarge the area. 

Image Removed

Multi-resolution grids

Multi resolution grids are represented by a hierarchical structure through two associations. Each StatisticalGrid instance can be associated with a lower and/or an upper resolution grid through the Hierarchical relation association. A StatisticalGridCell belonging to a given StatisticalGrid is composed of the overlapping cells its grid's lower resolution grid, and composes the cell it overlaps in its grid's higher resolution grid. Lower and upper StatisticalGridCells are associated through the Hierarchical composition. Figure 71 – INSPIRE Grid

Source: https://inspire.ec.europa.eu/id/document/tg/su 

Confidentiality for tabular data

HTML
Eurostat disseminates a high number of statistical tables. All these tabular data are treated for primary confidentiality. Primary confidentiality concerns tabular cell data, whose dissemination would permit attribute disclosure. The two main reasons for declaring data to be primary confidential are: too few contributors in a cell and dominance of the first n largest contributors in a cell <a href="#conf" aria-describedby="footnote-label" id="conf-ref"></a>. 

In the tables disseminated on Eurostat website, a cell is confidential if:

  • the (extrapolated) number of holdings that contribute to the cell (extrapolated) value is less than or equal to a certain value and/or
  • the n largest (extrapolated) holdings represent more than a certain percentage k% of the cell (extrapolated) value.

A confidential value is replaced with ":c".

For non-confidential cells, the extrapolated number of holdings and all values of variables in cells are rounded to the closest multiple of 10.

Because of the confidentiality treatment, the sum of the individual cells does not systematically match with the value of the "total" cell.

HTML
<footer>
    <h2 class="visually-hidden" id="footnote-label">Footnotes</h2>
    <ol>
      <li id="risk"> European Business Statistics Manual. <a href="https://ec.europa.eu/eurostat/statistics-explained/index.php?title=European_business_statistics_manual_-_Statistical_Disclosure_Control#SDC_rules_and_methods_for_tabular_data_" target="_blank"> See https://ec.europa.eu/eurostat/statistics-explained/index.php?title=European_business_statistics_manual_-_Statistical_Disclosure_Control#SDC_rules_and_methods_for_tabular_data_</a>. <a href="#risk-ref" aria-   label="Back to content">↩</a></li>
<li id="conf"> 
Handbook on Statistical Disclosure Control, version 1.2, Jan 2010. <a href="https://ec.europa.eu/eurostat/cros/system/files/SDC_Handbook.pdf" target="_blank"> See https://ec.europa.eu/eurostat/cros/system/files/SDC_Handbook.pdf</a>. <a href="#conf-ref" aria-   label="Back to content">↩</a></li>
    </ol>
  </footer>

...

hiddentrue

...