Missing data
Non-response: missing value in the record - eg. the respondent forgot/refused to answer some questionNon-response in Czech census 2001 |
|||
---|---|---|---|
Question | Title | Number of values | Non-response in % |
1. | Region of residence | 14 | 0.00 |
2. | Type of residence | 3 | 0.00 |
3. | Economic activity | 10 | 0.80 |
4. | Birth place (relatively) | 6 | 1.95 |
5. | Religion | 6 | 0.00 |
6. | Occupation type | 14 | 3.89 |
7. | Sex | 2 | 0.00 |
8. | Marital status | 4 | 0.55 |
9. | Education | 14 | 1.11 |
10. | Age | 9 | 0.03 |
11. | Category of flat | 5 | 0.53 |
12. | Bathroom | 5 | 0.59 |
13. | Size of flat | 7 | 0.64 |
14. | Internet and PC | 4 | 2.85 |
15. | Legal relation to flat | 9 | 0.39 |
16. | Gas supply | 3 | 0.78 |
17. | Number of rooms over 8m2 | 7 | 0.64 |
18. | Number of cars in household | 4 | 3.39 |
19. | Number of persons in flat | 6 | 0.00 |
20. | Vacational property | 6 | 7.45 |
21. | Telephone in flat | 5 | 1.80 |
22. | Water supply | 4 | 0.35 |
23. | Type of heating | 6 | 0.53 |
24. | Toilet | 6 | 0.50 |
Treatment of Non-response
- non-response as specific value
- non-response as unknown value
- estimation of missiong values
2001 Czech Census
- 1,524,240 incomplete records (ie. 15%)
- 2,933,427 missing values
- estimation accuracy: 73%
Remark
A typical feature of census data is the presence of incomplete records. In our experiments the census database (cf. Paper) included 1 524 240 incomplete records containing up to eighteen missing values. For this reason we decided to solve the estimation problem in two steps. First we estimated the mixture model from incomplete data by means of a modified EM algorithm and then the resulting mixture has been used to replace the missing values by estimates. In other words, we have replaced each non-response by the the most probable response with respect to the known part of the record. In the second step we have used the completed database to estimate the final distribution mixture.
There is no direct possibility to verify if the replacement of missing values has been done correctly but we can simulate an analogous situation by estimating the known values. In particular, for each variable separately, we have randomly chosen records for which the value of the tested variable was available. Then for each record we have computed the corresponding estimate of this value and compared it with the true original. On the average 73% of missing values would be correctly identified by the maximum-likelihood estimates.