Silvia Biffignandi Bergamo University Alessandro Zeli Istat silvia biffignandi unibgit zeli istait Q2010 Helsinki Biffignandi Silvia Zeli Alessandro ID: 797269
Download The PPT/PDF document "Integrating databases over time: what a..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Integrating databases over time: what about representativeness in longitudinal integrated and panel data?
Silvia Biffignandi, Bergamo University Alessandro Zeli, Istat
silvia. biffignandi @unibg.it
zeli @ista.it
Q2010
Helsinki
Biffignandi Silvia- Zeli Alessandro
Slide2The problem and the research objectives
Description of two longitudinal databases
Quality analyses of these databaseConclusions
Outline
Biffignandi
Silvia- Zeli Alessandro
Slide3NSIs usually carry
out business surveys at different point
in time
using
different samples
for different surveys
as
well
as
different samples over time Users need more and more statistical information New strategies required
The problem and the research objectives
Biffignandi Silvia- Zeli Alessandro
Slide4User needs for longitudinal dataunderstanding
aggregate changes in a variable, such as employment rate, over time b) studying the time-varying economic characteristic
(such as employment) of an individual
The
problem
and the research objectives
Biffignandi Silvia- Zeli Alessandro
Slide5Our task: construction of two longitudinal databases, based on various sources and on different criteria
to verify the consistency between estimates based on the databases and population data
……….and
the research
objectivesBiffignandi Silvia- Zeli Alessandro
Slide6IDB ( technically integrated database)
panel
Description of two longitudinal databases
Biffignandi Silvia- Zeli Alessandro
Slide7microdata
target population: enterprises with 20 employees or more(40% in terms of employment and 60% in terms of value added)
variables : balance sheet data; SBS regulation data
period: 1998-2004
Description of two longitudinal databases
Biffignandi Silvia- Zeli Alessandro
Slide8IDB data by sources (source percentage ) – Years 1998-2004
Biffignandi Silvia- Zeli Alessandro
Codes
description
only BIL i(
ril
= 9)
PMI non
respondents
,
but
data
integrated by BIL source (bil=5)SCI non respondents , but data integrated by BIL source(ril=3)SCI non respondents , but donor imputation (ril=2) SCI respondents (ril=1)
PMI respondents
(ril=0)
Years
Description of two longitudinal databases
1) IBD (technically integrated database)
Slide9a catch-up panel database
it takes business transformation into account
integrity criterion, i.e. all variables in the panel have to be present for
all enterprises in the whole panel period
Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases
2) Panel
Slide10Step
1:enterprises (with at least
20 persons employed
) respondent to SCI-PMI
surveys in the starting year (1998)
+
all enterprises with at least 100 employees (even if non respondents) if
the BIL source is available
(integration);
Step
2
:
continuity criterion to the previous enterprises;Step 3:
persistence criterion ( i.e. respondents in 1998 or have data in the BIL for at least 4 years are included
in the panel)
Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases
2) Panel
Slide11Panel 1998-2004 dimension in terms of firms number and persons employed
Firms number
Persons employed number
%
w.r.t.
population
Firms
Persons employed
Starting database (
Sci-Pmi
respondents)
17,097
3,902,001
24.2
67.0
Persistence
13,573
3,243,549
19.2
55.7
Biffignandi Silvia- Zeli Alessandro
Description of two longitudinal databases
2 )Panel
Slide12Quality analyses
Biffignandi Silvia- Zeli Alessandro
Verify the equality of the population structure into the different database (especially
the
sectoral composition)
.
We apply two different approaches:
the
statistical analysis of difference
b
etween the distributions of some important variables in IDB/panel and universe;
an
index of representativeness
related to main categorical variables
.
Slide131.a)
Spearman’s ranks correlation for distributions of value added, persons employed and turnover values of economic
divisions, years 1998 – 2004:
in IDB and universe (minimum 95,2 – maximum 99,8)
i
n
panel and universe (minimum 90,9- maximum 97)
In both situation correlation is very high in each year.
O
nly an ordinal ranking on the relative changes among the economic divisions.
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Difference between distributions 1.a)
2.a)
Fligner-Policello test of stochastic equality
distributions of shares of turnover, value added and employment by divisions of economic activities – Years 1992-2004
IDB
vs
Universe
Panel vs
Universe.
Test
not significant
for all variables for all years in the panel.
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Difference between distributions 2.a)
R-indexes (representativeness
)- RISQ project(see for instance, Schouten and Cobben, 2007; Shlomo et al. 2009). Support for the quality comparison of different surveys or register to compare the response:
to different surveys that share the same target population to a survey during data collection to a survey longitudinally
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes 2.)
Quality analyses
Representativeness indexes 2.)
a) Weak
representativity
R
2
indicator
i.e. the average response propensity over the categories is constant
Biffignandi Silvia- Zeli Alessandro
Response probability estimation
required
:
usually logistic regression model
I
n our study auxiliary variables are:
industrial division as classified in the NACE Rev1.1 (2 digit sectors)
3
size
classes: 20 to 49, 50 to 249 and 250 and over persons employed).
Slide17=
centralised regression parameter for category
h
b
)
If
X
is an auxiliary variable with
H
classes a
marginal indicator
(MR
2
)
proposed isBiffignandi Silvia- Zeli AlessandroQuality analysesRepresentativeness indexes b)
Slide18R2 index and lower bound
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes b)
Marginal R-index (panel years 1998,2001, 2004) by enterprise size
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes b)
Marginal R-index for section of economic (panel years 1998, 2001, 2004)
Biffignandi Silvia- Zeli Alessandro
Quality analyses
Representativeness indexes b)
Biffignandi Silvia- Zeli Alessandro
R2 indexes are very high in each year included in the panel marginal R-index remains essentially the same with a slight decrease over the period.
a overrepresentation of medium-large enterprises (with 50 persons employed and over) small enterprises (between 20 and 49 persons employed) are underrepresented
service sector is underrepresented
Summing up:
quite confident that the level of representativeness is appropriate in the global context.
Quality analyses
Representativeness indexes 2.)
Concluding remarks
-
the use of administrative data integration is promising for longitudinal database construction
- I
BD and
panel estimates are satisfactory
-
th
e panel allows for gain of information at reasonable cost/effort resources:
for instance the grouping of enterprise according to classifications selected by the user (gazelle, or best performer) other then the ordinary classification utilised in the sample design (economic activity, size, geographical area
)
Further research
-
more
representativity
indicator analyses
-
panel update criteria
Biffignandi Silvia- Zeli Alessandro
Slide23Thank
you
for
you
attention!!
Biffignandi Silvia- Zeli Alessandro
Slide24R-indexes (representativeness) - RISQ project Shouten and
Cobben (2007)weak representativity
is a selection indicator that takes value 1 if the unit is selected in the sample and 0 otherwise
first order inclusion probability
s
Weak
representativity
R
2
indicator:
a response subset is representative for a categorical variable X with H categories if the average response propensity over the categories is constant:
where
N
h
is the population size of category
h
,
r
h,k
is the response propensity of unit
k
in class
h
and summation is over all units in this category
.
Biffignandi Silvia- Zeli Alessandro
Slide25R-indexes (representativeness)Shouten and Cobben (2007)
RISQ project
The
response probability
Auxiliary
variables
in
our study are:
industrial division as classified in the NACE Rev1.1 (2 digit sectors)
3
size
classes: 20 to 49, 50 to 249 and 250 and over persons employed)
. Biffignandi Silvia- Zeli Alessandro
Slide26R-indexes (representativeness)Shouten and Cobben (2007)
Bias
where
where
Biffignandi Silvia- Zeli Alessandro
Slide27Fligner-Policello test
Taskverify the equality of the distributions for IDB or panel and universe of the relative shares of the economic divisions of the three variables considered with respect to the totals for each years. Fligner-Policello test no assumptions
no on normality no equal variances, not that the two distribution have a similar shape. IIt is a test of stochastic equality between two distributions, rejection of the null means that the two distribution are different in probability.
If the null hypothesis is rejected the sign of F-P statistic points out which of the two distributions is dominant: a positive sign means that panel shares have an higher probability to take greater values wit respect to the population.
Biffignandi Silvia- Zeli Alessandro