Fit a linear model with multiple group fixed effects
'felm' is used to fit linear models with multiple group fixed effects, similarly to lm. It uses the Method of Alternating projections to sweep out multiple group effects from the normal equations before estimating the remaining coefficients with OLS.
felm( formula, data, exactDOF = FALSE, subset, na.action, contrasts = NULL, weights = NULL, ... )
formula |
an object of class '"formula"' (or one that can be coerced to that class): a symbolic description of the model to be fitted. Similarly to 'lm'. See Details. |
data |
a data frame containing the variables of the model. |
exactDOF |
logical. If more than two factors, the degrees of freedom
used to scale the covariance matrix (and the standard errors) is normally
estimated. Setting If the degrees of freedom for some reason are known, they can be specified
like |
subset |
an optional vector specifying a subset of observations to be used in the fitting process. |
na.action |
a function which indicates what should happen when the data
contain |
contrasts |
an optional list. See the |
weights |
an optional vector of weights to be used in the fitting
process. Should be 'NULL' or a numeric vector. If non-NULL, weighted least
squares is used with weights |
... |
other arguments.
|
This function is intended for use with large datasets with multiple group
effects of large cardinality. If dummy-encoding the group effects results
in a manageable number of coefficients, you are probably better off by using
lm
.
The formula specification is a response variable followed by a four part
formula. The first part consists of ordinary covariates, the second part
consists of factors to be projected out. The third part is an
IV-specification. The fourth part is a cluster specification for the
standard errors. I.e. something like y ~ x1 + x2 | f1 + f2 | (Q|W ~
x3+x4) | clu1 + clu2
where y
is the response, x1,x2
are
ordinary covariates, f1,f2
are factors to be projected out, Q
and W
are covariates which are instrumented by x3
and
x4
, and clu1,clu2
are factors to be used for computing cluster
robust standard errors. Parts that are not used should be specified as
0
, except if it's at the end of the formula, where they can be
omitted. The parentheses are needed in the third part since |
has
higher precedence than ~
. Multiple left hand sides like y|w|x ~
x1 + x2 |f1+f2|...
are allowed.
Interactions between a covariate x
and a factor f
can be
projected out with the syntax x:f
. The terms in the second and
fourth parts are not treated as ordinary formulas, in particular it is not
possible with things like y ~ x1 | x*f
, rather one would specify
y ~ x1 + x | x:f + f
. Note that f:x
also works, since R's
parser does not keep the order. This means that in interactions, the factor
must be a factor, whereas a non-interacted factor will be coerced to
a factor. I.e. in y ~ x1 | x:f1 + f2
, the f1
must be a factor,
whereas it will work as expected if f2
is an integer vector.
In older versions of lfe the syntax was felm(y ~ x1 + x2 + G(f1)
+ G(f2), iv=list(Q ~ x3+x4, W ~ x3+x4), clustervar=c('clu1','clu2'))
. This
syntax still works, but yields a warning. Users are strongly
encouraged to change to the new multipart formula syntax. The old syntax
will be removed at a later time.
The standard errors are adjusted for the reduced degrees of freedom coming
from the dummies which are implicitly present. (An exception occurs in the
case of clustered standard errors and, specifically, where clusters are
nested within fixed effects; see
here.)
In the case of two factors,
the exact number of implicit dummies is easy to compute. If there are more
factors, the number of dummies is estimated by assuming there's one
reference-level for each factor, this may be a slight over-estimation,
leading to slightly too large standard errors. Setting exactDOF='rM'
computes the exact degrees of freedom with rankMatrix()
in package
Matrix.
For the iv-part of the formula, it is only necessary to include the
instruments on the right hand side. The other explanatory covariates, from
the first and second part of formula
, are added automatically in the
first stage regression. See the examples.
The contrasts
argument is similar to the one in lm()
, it is
used for factors in the first part of the formula. The factors in the second
part are analyzed as part of a possible subsequent getfe()
call.
The cmethod
argument may affect the clustered covariance matrix (and
thus regressor standard errors), either directly or via adjustments to a
degrees of freedom scaling factor. In particular, Cameron, Gelbach and Miller
(CGM2011, sec. 2.3) describe two possible small cluster corrections that are
relevant in the case of multiway clustering.
The first approach adjusts each component of the cluster-robust variance estimator (CRVE) by its own c_i adjustment factor. For example, the first component (with G clusters) is adjusted by c_1 = G/(G-1)*(N-1)/(N-K), the second component (with H clusters) is adjusted by c_2 = H/(H-1)*(N-1)/(N-K), etc.
The second approach applies the same adjustment to all CRVE components: c = J/(J-1)*(N-1)/(N-K), where J=min(G,H) in the case of two-way clustering, for example.
Any differences resulting from these two approaches are likely to be minor,
and they will obviously yield exactly the same results when there is only one
cluster dimension. Still, CGM2011 adopt the former approach in their own
paper and simulations. This is also the default method that felm
uses
(i.e. cmethod = 'cgm'
). However, the latter approach has since been
adopted by several other packages that allow for robust inference with
multiway clustering. This includes the popular Stata package
reghdfe, as well as the
FixedEffectModels.jl
implementation in Julia. To match results from these packages exactly, use
cmethod = 'cgm2'
(or its alias, cmethod = 'reghdfe'
). It is
possible that some residual differences may still remain; see discussion
here.
The old syntax with a single part formula with the G()
syntax for the
factors to transform away is still supported, as well as the
clustervar
and iv
arguments, but users are encouraged to move
to the new multi part formulas as described here. The clustervar
and
iv
arguments have been moved to the ...
argument list. They
will be removed in some future update.
felm
returns an object of class
"felm"
. It is
quite similar to an "lm"
object, but not entirely compatible.
The generic summary
-method will yield a summary which may be
print
'ed. The object has some resemblance to an 'lm'
object,
and some postprocessing methods designed for lm
may happen to work.
It may however be necessary to coerce the object to succeed with this.
The "felm"
object is a list containing the following fields:
coefficients |
a numerical vector. The estimated coefficients. |
N |
an integer. The number of observations |
p |
an integer. The total number of coefficients, including those projected out. |
response |
a numerical vector. The response vector. |
fitted.values |
a numerical vector. The fitted values. |
residuals |
a numerical vector. The residuals of the full system, with dummies. For IV-estimations, this is the residuals when the original endogenous variables are used, not their predictions from the 1st stage. |
r.residuals |
a numerical vector. Reduced residuals, i.e. the residuals resulting from predicting without the dummies. |
iv.residuals |
numerical vector. When using instrumental variables, residuals from 2. stage, i.e. when predicting with the predicted endogenous variables from the 1st stage. |
weights |
numeric. The square root of the argument |
cfactor |
factor of length N. The factor describing the connected components of the two first terms in the second part of the model formula. |
vcv |
a matrix. The variance-covariance matrix. |
fe |
list of factors. A list of the terms in the second part of the model formula. |
stage1 |
The ' |
iv1fstat |
list of numerical vectors. For IV 1st stage, F-value for excluded instruments, the number of parameters in restricted model and in the unrestricted model. |
X |
matrix. The expanded data matrix, i.e. from the first part of the
formula. To save memory with large datasets, it is only included if
|
cX, cY |
matrix. The centred expanded data matrix. Only included if
|
boot |
The result of a |
Side effect: If data
is an object of class "pdata.frame"
(from
the plm package), the plm namespace is loaded if available, and
data
is coerced to a "data.frame"
with as.data.frame
which dispatches to a plm method. This ensures that transformations
like diff
and lag
from plm works as expected, but it
also incurs an additional copy of the data
, and the plm
namespace remains loaded after felm
returns. When working with
"pdata.frame"
s, this is what is usually wanted anyway.
For technical reasons, when running IV-estimations, the data frame supplied
in the data
argument to felm
, should not contain
variables with names ending in '(fit)'
. Variables with such names
are used internally by felm
, and may then accidentally be looked up
in the data frame instead of the local environment where they are defined.
Cameron, A.C., J.B. Gelbach and D.L. Miller (2011) Robust inference with multiway clustering, Journal of Business & Economic Statistics 29 (2011), no. 2, 238–249. doi: 10.1198/jbes.2010.07136
Kolesar, M., R. Chetty, J. Friedman, E. Glaeser, and G.W. Imbens (2014) Identification and Inference with Many Invalid Instruments, Journal of Business & Economic Statistics (to appear). doi: 10.1080/07350015.2014.978175
## Default is to use all cores. We'll limit it to 2 for this example. oldopts <- options("lfe.threads") options(lfe.threads = 2) ## Simulate data set.seed(42) n <- 1e3 d <- data.frame( # Covariates x1 = rnorm(n), x2 = rnorm(n), # Individuals and firms id = factor(sample(20, n, replace=TRUE)), firm = factor(sample(13, n, replace=TRUE)), # Noise u = rnorm(n) ) # Effects for individuals and firms id.eff <- rnorm(nlevels(d$id)) firm.eff <- rnorm(nlevels(d$firm)) # Left hand side d$y <- d$x1 + 0.5*d$x2 + id.eff[d$id] + firm.eff[d$firm] + d$u ## Estimate the model and print the results est <- felm(y ~ x1 + x2 | id + firm, data = d) summary(est) # Compare with lm summary(lm(y ~ x1 + x2 + id + firm- 1, data = d)) ## Example with 'reverse causation' (IV regression) # Q and W are instrumented by x3 and the factor x4. d$x3 <- rnorm(n) d$x4 <- sample(12, n, replace=TRUE) d$Q <- 0.3*d$x3 + d$x1 + 0.2*d$x2 + id.eff[d$id] + 0.3*log(d$x4) - 0.3*d$y + rnorm(n, sd=0.3) d$W <- 0.7*d$x3 - 2*d$x1 + 0.1*d$x2 - 0.7*id.eff[d$id] + 0.8*cos(d$x4) - 0.2*d$y + rnorm(n, sd=0.6) # Add them to the outcome variable d$y <- d$y + d$Q + d$W ## Estimate the IV model and report robust SEs ivest <- felm(y ~ x1 + x2 | id + firm | (Q|W ~ x3 + factor(x4)), data = d) summary(ivest, robust = TRUE) condfstat(ivest) # Compare with the not instrumented fit: summary(felm(y ~ x1 + x2 + Q + W | id + firm, data = d)) ## Example with multiway clustering # Create a large cluster group (500 clusters) and a small one (20 clusters) d$cl1 <- factor(sample(rep(1:500, length.out=n))) d$cl2 <- factor(sample(rep(1:20, length.out=n))) # Function for adding clustered noise to our outcome variable cl_noise <- function(cl) { obs_per_cluster <- n/nlevels(cl) unlist(replicate(nlevels(cl), rnorm(obs_per_cluster, mean=rnorm(1), sd=runif(1)), simplify=FALSE)) } # New outcome variable d$y_cl <- d$x1 + 0.5*d$x2 + id.eff[d$id] + firm.eff[d$firm] + cl_noise(d$cl1) + cl_noise(d$cl2) ## Estimate and print the model with cluster-robust SEs (default) est_cl <- felm(y_cl ~ x1 + x2 | id + firm | 0 | cl1 + cl2, data = d) summary(est_cl) # Print ordinary standard errors: summary(est_cl, robust = FALSE) # Match cluster-robust SEs from Stata's reghdfe package: summary(felm(y_cl ~ x1 + x2 | id + firm | 0 | cl1 + cl2, data = d, cmethod = "reghdfe")) ## Restore default options options(oldopts)
Please choose more modern alternatives, such as Google Chrome or Mozilla Firefox.