程序代写代做代考 DNA Excel data structure algorithm How To: Use the psych package for Factor Analysis and data

How To: Use the psych package for Factor Analysis and data

reduction

William Revelle
Department of Psychology
Northwestern University

November 20, 2016

Contents

1 Overview of this and related documents 3
1.1 Jump starting the psych package–a guide for the impatient . . . . . . . . . 3

2 Overview of this and related documents 5

3 Getting started 5

4 Basic data analysis 6
4.1 Data input from the clipboard . . . . . . . . . . . . . . . . . . . . . . . . . 6
4.2 Basic descriptive statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
4.3 Simple descriptive graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.3.1 Scatter Plot Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.2 Correlational structure . . . . . . . . . . . . . . . . . . . . . . . . . . 8
4.3.3 Heatmap displays of correlational structure . . . . . . . . . . . . . . 10

4.4 Polychoric, tetrachoric, polyserial, and biserial correlations . . . . . . . . . . 13

5 Item and scale analysis 13
5.1 Dimension reduction through factor analysis and cluster analysis . . . . . . 13

5.1.1 Minimum Residual Factor Analysis . . . . . . . . . . . . . . . . . . . 15
5.1.2 Principal Axis Factor Analysis . . . . . . . . . . . . . . . . . . . . . 16
5.1.3 Weighted Least Squares Factor Analysis . . . . . . . . . . . . . . . . 16
5.1.4 Principal Components analysis (PCA) . . . . . . . . . . . . . . . . . 22
5.1.5 Hierarchical and bi-factor solutions . . . . . . . . . . . . . . . . . . . 22
5.1.6 Item Cluster Analysis: iclust . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Confidence intervals using bootstrapping techniques . . . . . . . . . . . . . 29

1

5.3 Comparing factor/component/cluster solutions . . . . . . . . . . . . . . . . 29
5.4 Determining the number of dimensions to extract. . . . . . . . . . . . . . . 35

5.4.1 Very Simple Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.4.2 Parallel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.5 Factor extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Classical Test Theory and Reliability 40
6.1 Reliability of a single scale . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2 Using omega to find the reliability of a single scale . . . . . . . . . . . . . . 46
6.3 Estimating ωh using Confirmatory Factor Analysis . . . . . . . . . . . . . . 50

6.3.1 Other estimates of reliability . . . . . . . . . . . . . . . . . . . . . . 52
6.4 Reliability and correlations of multiple scales within an inventory . . . . . . 52

6.4.1 Scoring from raw data . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.4.2 Forming scales from a correlation matrix . . . . . . . . . . . . . . . . 55

6.5 Scoring Multiple Choice Items . . . . . . . . . . . . . . . . . . . . . . . . . . 57
6.6 Item analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

7 Item Response Theory analysis 58
7.1 Factor analysis and Item Response Theory . . . . . . . . . . . . . . . . . . . 60
7.2 Speeding up analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.3 IRT based scoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

8 Multilevel modeling 68
8.1 Decomposing data into within and between level correlations using statsBy 68
8.2 Generating and displaying multilevel data . . . . . . . . . . . . . . . . . . . 70

9 Set Correlation and Multiple Regression from the correlation matrix 70

10 Simulation functions 73

11 Graphical Displays 75

12 Miscellaneous functions 77

13 Data sets 78

14 Development version and a users guide 79

15 Psychometric Theory 80

16 SessionInfo 80

2

1 Overview of this and related documents

To do basic and advanced personality and psychological research using R is not as compli-
cated as some think. This is one of a set of “How To” to do various things using R (R Core
Team, 2016), particularly using the psych (Revelle, 2016) package.

The current list of How To’s includes:

1. Installing R and some useful packages

2. Using R and the psych package to find omegah and ωt .

3. Using R and the psych for factor analysis and principal components analysis. (This
document).

4. Using the score.items function to find scale scores and scale statistics.

5. An overview (vignette) of the psych package

1.1 Jump starting the psych package–a guide for the impatient

You have installed psych (section 3) and you want to use it without reading much more.
What should you do?

1. Activate the psych package:
R code

library(psych)

2. Input your data (section 4.1). Go to your friendly text editor or data manipulation
program (e.g., Excel) and copy the data to the clipboard. Include a first line that has
the variable labels. Paste it into psych using the read.clipboard.tab command:

R code
> data(sat.act)
> describe(sat.act) #basic descriptive statistics

vars n mean sd median trimmed mad min max range skew kurtosis se
gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61 -1.62 0.02
education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68 -0.07 0.05
age 3 700 25.59 9.50 22 23.86 5.93 13 65 52 1.64 2.42 0.36
ACT 4 700 28.55 4.82 29 28.84 4.45 3 36 33 -0.66 0.53 0.18
SATV 5 700 612.23 112.90 620 619.45 118.61 200 800 600 -0.64 0.33 4.27
SATQ 6 687 610.22 115.64 620 617.25 118.61 200 800 600 -0.59 -0.02 4.41

7

4.3 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well as
communicating important results. Scatter Plot Matrices (SPLOMS) using the pairs.panels
function are useful ways to look for strange effects involving outliers and non-linearities.
error.bars.by will show group means with 95% confidence boundaries.

4.3.1 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data. The pairs.panels
function, adapted from the help menu for the pairs function produces xy scatter plots of
each pair of variables below the diagonal, shows the histogram of each variable on the
diagonal, and shows the lowess locally fit regression line as well. An ellipse around the
mean with the axis length reflecting one standard deviation of the x and y variables is also
drawn. The x axis in each scatter plot represents the column variable, the y axis the row
variable (Figure 1). When plotting many subjects, it is both faster and cleaner to set the
plot character (pch) to be ’.’. (See Figure 1 for an example.)

pairs.panels will show the pairwise scatter plots of all the variables as well as his-
tograms, locally smoothed regressions, and the Pearson correlation. When plotting
many data points (as in the case of the sat.act data, it is possible to specify that the
plot character is a period to get a somewhat cleaner graphic.

4.3.2 Correlational structure

There are many ways to display correlations. Tabular displays are probably the most
common. The output from the cor function in core R is a rectangular matrix. lowerMat
will round this to (2) digits and then display as a lower off diagonal matrix. lowerCor
calls cor with use=‘pairwise’, method=‘pearson’ as default values and returns (invisibly)
the full correlation matrix and displays the lower off diagonal matrix.

> lowerCor(sat.act)

gendr edctn age ACT SATV SATQ
gender 1.00
education 0.09 1.00
age -0.02 0.55 1.00
ACT -0.04 0.15 0.11 1.00
SATV -0.02 0.05 -0.04 0.56 1.00
SATQ -0.17 0.03 -0.03 0.59 0.64 1.00

When comparing results from two different groups, it is convenient to display them as one
matrix, with the results from one group below the diagonal, and the other group above the
diagonal. Use lowerUpper to do this:

8

> png( ‘pairspanels.png’ )
> pairs.panels(sat.act,pch=’.’)
> dev.off()
null device

1

Figure 1: Using the pairs.panels function to graphically show relationships. The x axis
in each scatter plot represents the column variable, the y axis the row variable. Note the
extreme outlier for the ACT. The plot character was set to a period (pch=’.’) in order to
make a cleaner graph.

9

> female <- subset(sat.act,sat.act\$gender==2) > male <- subset(sat.act,sat.act\$gender==1) > lower <- lowerCor(male[-1]) edctn age ACT SATV SATQ education 1.00 age 0.61 1.00 ACT 0.16 0.15 1.00 SATV 0.02 -0.06 0.61 1.00 SATQ 0.08 0.04 0.60 0.68 1.00 > upper <- lowerCor(female[-1]) edctn age ACT SATV SATQ education 1.00 age 0.52 1.00 ACT 0.16 0.08 1.00 SATV 0.07 -0.03 0.53 1.00 SATQ 0.03 -0.09 0.58 0.63 1.00 > both <- lowerUpper(lower,upper) > round(both,2)

education age ACT SATV SATQ
education NA 0.52 0.16 0.07 0.03
age 0.61 NA 0.08 -0.03 -0.09
ACT 0.16 0.15 NA 0.53 0.58
SATV 0.02 -0.06 0.61 NA 0.63
SATQ 0.08 0.04 0.60 0.68 NA

It is also possible to compare two matrices by taking their differences and displaying one (be-
low the diagonal) and the difference of the second from the first above the diagonal:

> diffs <- lowerUpper(lower,upper,diff=TRUE) > round(diffs,2)

education age ACT SATV SATQ
education NA 0.09 0.00 -0.05 0.05
age 0.61 NA 0.07 -0.03 0.13
ACT 0.16 0.15 NA 0.08 0.02
SATV 0.02 -0.06 0.61 NA 0.05
SATQ 0.08 0.04 0.60 0.68 NA

4.3.3 Heatmap displays of correlational structure

Perhaps a better way to see the structure in a correlation matrix is to display a heat map
of the correlations. This is just a matrix color coded to represent the magnitude of the
correlation. This is useful when considering the number of factors in a data set. Consider
the Thurstone data set which has a clear 3 factor solution (Figure 2) or a simulated data
set of 24 variables with a circumplex structure (Figure 3). The color coding represents a
“heat map” of the correlations, with darker shades of red representing stronger negative
and darker shades of blue stronger positive correlations. As an option, the value of the
correlation can be shown.

10

> png(‘corplot.png’)
> cor.plot(Thurstone,numbers=TRUE,main=”9 cognitive variables from Thurstone”)
> dev.off()
null device

1

Figure 2: The structure of correlation matrix can be seen more clearly if the variables are
grouped by factor and then the correlations are shown by color. By using the ’numbers’
option, the values are displayed as well.

11

> png(‘circplot.png’)
> circ <- sim.circ(24) > r.circ <- cor(circ) > cor.plot(r.circ,main=’24 variables in a circumplex’)
> dev.off()
null device

1

Figure 3: Using the cor.plot function to show the correlations in a circumplex. Correlations
are highest near the diagonal, diminish to zero further from the diagonal, and the increase
again towards the corners of the matrix. Circumplex structures are common in the study
of affect.

12

4.4 Polychoric, tetrachoric, polyserial, and biserial correlations

The Pearson correlation of dichotomous data is also known as the φ coefficient. If the
data, e.g., ability items, are thought to represent an underlying continuous although latent
variable, the φ will underestimate the value of the Pearson applied to these latent variables.
One solution to this problem is to use the tetrachoric correlation which is based upon
the assumption of a bivariate normal distribution that has been cut at certain points. The
draw.tetra function demonstrates the process, A simple generalization of this to the case
of the multiple cuts is the polychoric correlation.

Other estimated correlations based upon the assumption of bivariate normality with cut
points include the biserial and polyserial correlation.

If the data are a mix of continuous, polytomous and dichotomous variables, the mixed.cor
function will calculate the appropriate mixture of Pearson, polychoric, tetrachoric, biserial,
and polyserial correlations.

The correlation matrix resulting from a number of tetrachoric or polychoric correlation
matrix sometimes will not be positive semi-definite. This will also happen if the correlation
matrix is formed by using pair-wise deletion of cases. The cor.smooth function will adjust
the smallest eigen values of the correlation matrix to make them positive, rescale all of
them to sum to the number of variables, and produce a “smoothed” correlation matrix. An
example of this problem is a data set of burt which probably had a typo in the original
correlation matrix. Smoothing the matrix corrects this problem.

5 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and of
scales and for finding various estimates of scale reliability. These may be considered as
problems of dimension reduction (e.g., factor analysis, cluster analysis, principal compo-
nents analysis) and of forming and estimating the reliability of the resulting composite
scales.

5.1 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictum
commonly attributed to William of Ockham to not multiply entities beyond necessity1. The
goal for parsimony is seen in psychometrics as an attempt either to describe (components)

1Although probably neither original with Ockham nor directly stated by him (Thorburn, 1918), Ock-
ham’s razor remains a fundamental principal of science.

13

or to explain (factors) the relationships between many observed variables in terms of a
more limited set of components or latent factors.

The typical data matrix represents multiple items or scales usually thought to reflect fewer
underlying constructs2. At the most simple, a set of items can be be thought to represent
a random sample from one underlying domain or perhaps a small set of domains. The
question for the psychometrician is how many domains are represented and how well does
each item represent the domains. Solutions to this problem are examples of factor analysis
(FA), principal components analysis (PCA), and cluster analysis (CA). All of these pro-
cedures aim to reduce the complexity of the observed data. In the case of FA, the goal is
to identify fewer underlying constructs to explain the observed data. In the case of PCA,
the goal can be mere data reduction, but the interpretation of components is frequently
done in terms similar to those used when describing the latent variables estimated by FA.
Cluster analytic techniques, although usually used to partition the subject space rather
than the variable space, can also be used to group variables to reduce the complexity of
the data by forming fewer and more homogeneous sets of tests or items.

At the data level the data reduction problem may be solved as a Singular Value Decom-
position of the original matrix, although the more typical solution is to find either the
principal components or factors of the covariance or correlation matrices. Given the pat-
tern of regression weights from the variables to the components or from the factors to the
variables, it is then possible to find (for components) individual component or cluster scores
or estimate (for factors) factor scores.

Several of the functions in psych address the problem of data reduction.

fa incorporates five alternative algorithms: minres factor analysis, principal axis factor
analysis, weighted least squares factor analysis, generalized least squares factor anal-
ysis and maximum likelihood factor analysis. That is, it includes the functionality of
three other functions that will be eventually phased out.

principal Principal Components Analysis reports the largest n eigen vectors rescaled by
the square root of their eigen values.

factor.congruence The congruence between two factors is the cosine of the angle between
them. This is just the cross products of the loadings divided by the sum of the squared
subtracted before taking the products. factor.congruence will find the cosines

vss Very Simple Structure Revelle and Rocklin (1979) applies a goodness of fit test to
determine the optimal number of factors to extract. It can be thought of as a quasi-

2Cattell (1978) as well as MacCallum et al. (2007) argue that the data are the result of many more
factors than observed variables, but are willing to estimate the major underlying factors.

14

confirmatory model, in that it fits the very simple structure (all except the biggest c
loadings per item are set to zero where c is the level of complexity of the item) of a
factor pattern matrix to the original correlation matrix. For items where the model is
usually of complexity one, this is equivalent to making all except the largest loading
for each item 0. This is typically the solution that the user wants to interpret. The
analysis includes the MAP criterion of Velicer (1976) and a χ2 estimate.

fa.parallel The parallel factors technique compares the observed eigen values of a cor-
relation matrix with those from random data.

fa.plot will plot the loadings from a factor, principal components, or cluster analysis
(just a call to plot will suffice). If there are more than two factors, then a SPLOM

nfactors A number of different tests for the number of factors problem are run.

fa.diagram replaces fa.graph and will draw a path diagram representing the factor struc-
ture. It does not require Rgraphviz and thus is probably preferred.

fa.graph requires Rgraphviz and will draw a graphic representation of the factor struc-
ture. If factors are correlated, this will be represented as well.

iclust is meant to do item cluster analysis using a hierarchical clustering algorithm
specifically asking questions about the reliability of the clusters (Revelle, 1979). Clus-
ters are formed until either coefficient α Cronbach (1951) or β Revelle (1979) fail to
increase.

5.1.1 Minimum Residual Factor Analysis

The factor model is an approximation of a correlation matrix by a matrix of lower rank.
That is, can the correlation matrix, nRn be approximated by the product of a factor matrix,
nFk and its transpose plus a diagonal matrix of uniqueness.

R = FF ′+U2 (1)

The maximum likelihood solution to this equation is found by factanal in the stats pack-
age. Five alternatives are provided in psych, all of them are included in the fa function
and are called by specifying the factor method (e.g., fm=“minres”, fm=“pa”, fm=“”wls”,
fm=”gls” and fm=”ml”). In the discussion of the other algorithms, the calls shown are to
the fa function specifying the appropriate method.

factor.minres attempts to minimize the off diagonal residual correlation matrix by ad-
justing the eigen values of the original correlation matrix. This is similar to what is done

15

in factanal, but uses an ordinary least squares instead of a maximum likelihood fit func-
tion. The solutions tend to be more similar to the MLE solutions than are the factor.pa
solutions. min.res is the default for the fa function.

A classic data set, collected by Thurstone and Thurstone (1941) and then reanalyzed by
Bechtoldt (1961) and discussed by McDonald (1999), is a set of 9 cognitive variables with
a clear bi-factor structure Holzinger and Swineford (1937). The minimum residual solution
was transformed into an oblique solution using the default option on rotate which uses
an oblimin transformation (Table 1). Alternative rotations and transformations include
“none”, “varimax”, “quartimax”, “bentlerT”, and “geominT” (all of which are orthogonal
rotations). as well as “promax”, “oblimin”, “simplimax”, “bentlerQ, and“geominQ” and
“cluster” which are possible oblique transformations of the solution. The default is to do
a oblimin transformation, although prior versions defaulted to varimax. The measures of
factor adequacy reflect the multiple correlations of the factors with the best fitting linear
regression estimates of the factor scores (Grice, 2001).

5.1.2 Principal Axis Factor Analysis

An alternative, least squares algorithm, factor.pa, (incorporated into fa as an option (fm
= “pa”) does a Principal Axis factor analysis by iteratively doing an eigen value decompo-
sition of the correlation matrix with the diagonal replaced by the values estimated by the
factors of the previous iteration. This OLS solution is not as sensitive to improper matri-
ces as is the maximum likelihood method, and will sometimes produce more interpretable
results. It seems as if the SAS example for PA uses only one iteration. Setting the max.iter
parameter to 1 produces the SAS solution.

The solutions from the fa, the factor.minres and factor.pa as well as the principal
functions can be rotated or transformed with a number of options. Some of these call
the GPArotation package. Orthogonal rotations are varimax and quartimax. Oblique
transformations include oblimin, quartimin and then two targeted rotation functions
Promax and target.rot. The latter of these will transform a loadings matrix towards
an arbitrary target matrix. The default is to transform towards an independent cluster
solution.

Using the Thurstone data set, three factors were requested and then transformed into an
independent clusters solution using targ

Posted in Uncategorized