Missing data

Non-response: missing value in the record - eg. the respondent forgot/refused to answer some question

Non-response in Czech census 2001
Question	Title	Number of values	Non-response in %
1.	Region of residence	14	0.00
2.	Type of residence	3	0.00
3.	Economic activity	10	0.80
4.	Birth place (relatively)	6	1.95
5.	Religion	6	0.00
6.	Occupation type	14	3.89
7.	Sex	2	0.00
8.	Marital status	4	0.55
9.	Education	14	1.11
10.	Age	9	0.03
11.	Category of flat	5	0.53
12.	Bathroom	5	0.59
13.	Size of flat	7	0.64
14.	Internet and PC	4	2.85
15.	Legal relation to flat	9	0.39
16.	Gas supply	3	0.78
17.	Number of rooms over 8m²	7	0.64
18.	Number of cars in household	4	3.39
19.	Number of persons in flat	6	0.00
20.	Vacational property	6	7.45
21.	Telephone in flat	5	1.80
22.	Water supply	4	0.35
23.	Type of heating	6	0.53
24.	Toilet	6	0.50

Treatment of Non-response

non-response as specific value
non-response as unknown value
estimation of missiong values

2001 Czech Census

1,524,240 incomplete records (ie. 15%)
2,933,427 missing values
estimation accuracy: 73%

Remark

A typical feature of census data is the presence of incomplete records. In our experiments the census database (cf. Paper) included 1 524 240 incomplete records containing up to eighteen missing values. For this reason we decided to solve the estimation problem in two steps. First we estimated the mixture model from incomplete data by means of a modified EM algorithm and then the resulting mixture has been used to replace the missing values by estimates. In other words, we have replaced each non-response by the the most probable response with respect to the known part of the record. In the second step we have used the completed database to estimate the final distribution mixture.

There is no direct possibility to verify if the replacement of missing values has been done correctly but we can simulate an analogous situation by estimating the known values. In particular, for each variable separately, we have randomly chosen records for which the value of the tested variable was available. Then for each record we have computed the corresponding estimate of this value and compared it with the true original. On the average 73% of missing values would be correctly identified by the maximum-likelihood estimates.

Presentation of Census Results by Interactive Statistical Models

Missing data

Non-response in Czech census 2001

Treatment of Non-response

2001 Czech Census

Remark