Presentation of Census Results by Interactive Statistical Models

Missing data

Non-response: missing value in the record - eg. the respondent forgot/refused to answer some question

Non-response in Czech census 2001

Question Title Number of
values
Non-response
in %
1. Region of residence 14 0.00
2. Type of residence 3 0.00
3. Economic activity 10 0.80
4. Birth place (relatively) 6 1.95
5. Religion 6 0.00
6. Occupation type 14 3.89
7. Sex 2 0.00
8. Marital status 4 0.55
9. Education 14 1.11
10. Age 9 0.03
11. Category of flat 5 0.53
12. Bathroom 5 0.59
13. Size of flat 7 0.64
14. Internet and PC 4 2.85
15. Legal relation to flat 9 0.39
16. Gas supply 3 0.78
17. Number of rooms over 8m2 7 0.64
18. Number of cars in household 4 3.39
19. Number of persons in flat 6 0.00
20. Vacational property 6 7.45
21. Telephone in flat 5 1.80
22. Water supply 4 0.35
23. Type of heating 6 0.53
24. Toilet 6 0.50

Treatment of Non-response

  1. non-response as specific value
  2. non-response as unknown value
  3. estimation of missiong values

2001 Czech Census

  1. 1,524,240 incomplete records (ie. 15%)
  2. 2,933,427 missing values
  3. estimation accuracy: 73%

Remark

A typical feature of census data is the presence of incomplete records. In our experiments the census database (cf. Paper) included 1 524 240 incomplete records containing up to eighteen missing values. For this reason we decided to solve the estimation problem in two steps. First we estimated the mixture model from incomplete data by means of a modified EM algorithm and then the resulting mixture has been used to replace the missing values by estimates. In other words, we have replaced each non-response by the the most probable response with respect to the known part of the record. In the second step we have used the completed database to estimate the final distribution mixture.

There is no direct possibility to verify if the replacement of missing values has been done correctly but we can simulate an analogous situation by estimating the known values. In particular, for each variable separately, we have randomly chosen records for which the value of the tested variable was available. Then for each record we have computed the corresponding estimate of this value and compared it with the true original. On the average 73% of missing values would be correctly identified by the maximum-likelihood estimates.