Analysis Panel Data in R and Stata

Yoonseul Choi
5 min readAug 4, 2022

--

In the case of panel data, the household id variable and the time variable defining the panel group must exist in the data. Assuming that the group id is i=1,2,3,,,n, the number of households is n, and the time id is i=1,2,,,t, each household is examined for t years, the total number of samples can be calculated as N=n*t (when n according to t is equal)

In the unbalanced panel data, not all panel groups i have been investigated at a certain point in time, but there are excluded panel groups. Therefore, the number of samples in the unbalanced panel can be written as follows. In the following equation, T_i is defined as the number of time observations of the panel group i.

When the panel data structure is divided based on the time gap, it can be divided into a case where there is a time gap and a case where there is no time gap. If there are continuous time observations in the panel group, it is panel data without a time gap. An example of panel data with a time gap is when a particular household skips the survey response and then returns to the panel.

The relative size of the number of panel groups n and the number T at the time of irradiation may be compared to be divided into micro panel and macro panel. In the micro panel data, the number of panel groups n>>T is much greater than the number of time observations each panel has. It is mainly panel data observed in household or household member survey data. On the other hand, in the macro panel data, the number of n < T panel groups is less than the number of observation points. Long-term accumulated OECD national panel data can be an example. Macro panel data has a lot of information in analyzing changes in objects over time. On the other hand, micro panels are useful for analysis considering unit heterogeneity because they have a lot of object information at a certain point in time.

In the statistical package that deals with panel data, it is necessary to first inform that the given data is ‘panel data’. For example, Stata uses the tset or xtset command as follows: Enter the panel group variable and the time variable in order after the tset command. The tset command can only be executed if the panel group variable cannot be a character variable and must be a numerical variable. The same results are obtained using xtset. The xtset has the advantage of being used in multi-level data as well as panel data.

R

Use the commands in the plm library. You can think of the pdata.frame command as the tset of Stata. In the index option, specify the panel group variable and the time variable. You can understand the structure of panel data and sample size through the pdim command.

After panel data is created, basic statistical analysis is conducted. The variance of x_it can be calculated in three types. The within variance is always zero because it is

in the time-invariant variable.

Between variance can be understood as the difference between panel groups, and within variance can be understood as the difference between observed points of time within panel groups. If the between variance is larger than the within variance, it should be understood that the difference of the x variable between panel groups is larger than the difference of the x variable within the panel group.

The dependent variable observed in a particular household may have a positive correlation between time points. This is called intra-class correlation, or serial correlation with a group within a panel group. The correlation coefficient between t and s in group i is given as follows.

The basic statistics of categorical variables can be shown by calculating the frequency. The table command can be used for cross-sectional data, but the xttab command is used to calculate the frequency table for panel data.

Since the panel data is a structure that is irradiated over time, transition probability may be calculated. The transition probability is defined as follows. It is the probability that a household in the k state at the time t-1 changes to the s state at the time t, which is the next period.

Since the panel data includes both cross-sectional characteristics and time series characteristics of variables, changes over time can be shown in a graph. A panel line graph can be created to understand the change in income by year of the panel group.

Linear Regression with Panel Data

The panel data is composed of observations in which one object is repeatedly investigated over time. Therefore, each observation is distinguished by a subheader it. i is the panel group and t is the subheader for time in i group. The linear regression model based on the given panel data can be expressed as follows.

In the regression model of Equation 1, the main assumptions are as follows.

  1. homogeneity across panel groups about parameters.
    Alpha_i = alpha and beta_i = beta. That is, it is assumed that all panel groups have the same constant term and slope parameter beta.
  2. Exogenity accumulation required to obtain an unbiased estimator.
    Satisfy the following expression. In other words, it assumes an unrelated explanatory variable for the error term.

3. The variance of the error term also has homoskedasticity for the panel group i. In addition, serial correlation does not exist in panel group i. It is assumed as follows.

If both assumptions 1 ~3 are satisfied in the regression model of Equation 1, the result of estimating OLS by pooling all i and t (Pooled OLS) is a consistent estimate and an efficient estimate. However, due to the nature of the panel data, a violation may occur in the above assumption, and various panel regression estimation can be used to solve the problem.

--

--

Yoonseul Choi
Yoonseul Choi

Written by Yoonseul Choi

Data Scientist, AI/DX Team, Mediplus Solution Co., Ltd. Master's degree of Statistics at Hanyang University. R / Python. Based in Seoul.

No responses yet