程序代写代做代考 DNA data structure algorithm An overview of the psych package

An overview of the psych package

William Revelle
Department of Psychology
Northwestern University

August 11, 2011

Contents

1 Overview of this and related documents 2

2 Getting started 3

3 Basic data analysis 4
3.1 Data input and descriptive statistics . . . . . . . . . . . . . . . . . . . . . . 4

3.1.1 Basic data cleaning using scrub . . . . . . . . . . . . . . . . . . . . 7
3.2 Simple descriptive graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Scatter Plot Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.2 Means and error bars . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2.3 Back to back histograms . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.2.4 Correlational structure . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Testing correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Polychoric, tetrachoric, polyserial, and biserial correlations . . . . . . . . . . 19

4 Item and scale analysis 21
4.1 Dimension reduction through factor analysis and cluster analysis . . . . . . 21

4.1.1 Minimum Residual Factor Analysis . . . . . . . . . . . . . . . . . . . 23
4.1.2 Principal Axis Factor Analysis . . . . . . . . . . . . . . . . . . . . . 25
4.1.3 Weighted Least Squares Factor Analysis . . . . . . . . . . . . . . . . 25
4.1.4 Principal Components analysis . . . . . . . . . . . . . . . . . . . . . 30
4.1.5 Hierarchical and bi-factor solutions . . . . . . . . . . . . . . . . . . . 30
4.1.6 Item Cluster Analysis: iclust . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Confidence intervals using bootstrapping techniques . . . . . . . . . . . . . 37
4.3 Comparing factor/component/cluster solutions . . . . . . . . . . . . . . . . 37
4.4 Determining the number of dimensions to extract. . . . . . . . . . . . . . . 37

4.4.1 Very Simple Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Parallel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Factor extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.6 Reliability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6.1 Reliability of a single scale . . . . . . . . . . . . . . . . . . . . . . . 46
4.6.2 Using omega to find the reliability of a single scale . . . . . . . . . . 50
4.6.3 Estimating ωh using Confirmatory Factor Analysis . . . . . . . . . . 54
4.6.4 Other estimates of reliability . . . . . . . . . . . . . . . . . . . . . . 56
4.6.5 Reliability of multiple scales within an inventory . . . . . . . . . . . 57

4.7 Item analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Item Response Theory analysis 63
5.1 Factor analysis and Item Response Theory . . . . . . . . . . . . . . . . . . . 63
5.2 Speeding up analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Set Correlation and Multiple Regression from the correlation matrix 68

7 Simulation functions 72

8 Graphical Displays 74

9 Data sets 74

10 Development version and a users guide 77

11 Psychometric Theory 77

12 SessionInfo 77

1 Overview of this and related documents

The psych package (Revelle, 2011) has been developed at Northwestern University to in-
clude functions most useful for personality, psychometric, and psychological research. Some
of the functions (e.g., read.clipboard, describe, pairs.panels, error.bars) are useful
for basic data entry and descriptive analyses.

Psychometric applications include routines for five types of factor analysis (fa does mini-
mum residual , principal axis, weighted least squares, generalized least squares and maximum
likelihood factor analysis). Determining the number of factors or components to extract
may be done by using the Very Simple Structure (Revelle and Rocklin, 1979) (vss), Min-
imum Average Partial correlation (Velicer, 1976) (MAP) or parallel analysis fa.parallel

criteria. Item Response Theory models for dichotomous or polytomous items may be found
by factoring tetrachoric or polychoric correlation matrices and expressing the result-
ing parameters in terms of location and discrimination irt.fa. Bifactor and hierarchical
factor structures may be estimated by using Schmid Leiman transformations (Schmid and
Leiman, 1957) (schmid) to transform a hierarchical factor structure into a bifactor solution
(Holzinger and Swineford, 1937). Scale construction can be done using the Item Cluster
Analysis (Revelle, 1979) (iclust) function to determine the structure and to calculate reli-
ability coefficients α Cronbach (1951)(alpha, score.items, score.multiple.choice), β
(Revelle, 1979; Revelle and Zinbarg, 2009) (iclust) and McDonald’s ωh and ωt (McDon-
ald, 1999) (omega). Guttman’s six estimates of internal consistency reliability (Guttman
(1945), as well as additional estimates (Revelle and Zinbarg, 2009) are in the guttman func-
tion and the six measures of Intraclass correlation coefficients (ICC) discussed by Shrout
and Fleiss (1979) are also available.

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairs.panels, factor,
cluster, and structural diagrams using fa.diagram, iclust.diagram, structure.diagram,
as well as item response characteristics and item and test information characteristic curves
plot.irt.

This vignette is meant to give an overview of the psych package. That is, it is meant
to give a summary of the main functions in the psych package with examples of how
they are used for data description, dimension reduction, and scale construction. The
extended user manual at psych_manual.pdf includes examples of graphic output and
more extensive demonstrations than are found in the help menus. (Also available at
http://personality-project.org/r/psych_manual.pdf). The vignette, psych for sem,
at psych_for_sem.pdf, discusses how to use psych as a front end to the sem package of
John Fox (Fox, 2009). (The vignette is also available at http://personality-project.
org/r/book/psych_for_sem.pdf).

For a step by step tutorial in the use of the psych package and the base functions in
R for basic personality research, see the guide for using R for personality research at
http://personalitytheory.org/r/r.short.html. For an introduction to psychometric
theory with applications in R, see the draft chapters at http://personality-project.
org/r/book).

2 Getting started

Some of the functions described in this overview require other packages. Particularly
useful for rotating the results of factor analyses (from e.g., fa, factor.minres, factor.pa,
factor.wls, or principal) or hierarchical factor models using omega or schmid, is the
GPArotation package. For some analyses of correlations of dichotomous data, the polycor

psych_manual.pdf

psych_for_sem.pdf
“http://personality-project.org/r/book/psych_for_sem.pdf”

“http://personality-project.org/r/book/psych_for_sem.pdf”

http://personalitytheory.org/r/r.short.html
http://personality-project.org/r/book
http://personality-project.org/r/book

package is suggested in order to use either the poly.mat or phi2poly functions. However,
these functions have been supplement with tetrachoric and polychoric which do not
require the polycor package. These and other useful packages may be installed by first
installing and then using the task views (ctv) package to install the “Psychometrics” task
view, but doing it this way is not necessary.

install.packages(“ctv”)

library(ctv)

task.views(“Psychometrics”)

Because of the difficulty of installing the package Rgraphviz , alternative graphics have been
developed and are available as diagram functions. If Rgraphviz is available, some functions
will take advantage of it.

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptive
statistics.

Remember, to run any of the psych functions, it is necessary to make the package active
by using the library command:

> library(psych)

The other packages, once installed, will be called automatically by psych.

It is possible to automatically load psych and other functions by creating and then saving
a “.First” function:

.First <- function(x) {library(psych)} 3.1 Data input and descriptive statistics There are of course many ways to enter data into R. Reading from a local file using read.table is perhaps the most preferred. However, many users will enter their data in a text editor or spreadsheet program and then want to copy and paste into R. This may be done by using read.table and specifying the input file as “clipboard” (PCs) or “pipe(pbpaste)” (Macs). Alternatively, the read.clipboard set of functions are perhaps more user friendly: read.clipboard is the base function for reading data from the clipboard. read.clipboard.csv for reading text that is comma delimited. 4 read.clipboard.tab for reading text that is tab delimited. read.clipboard.lower for reading input of a lower triangular matrix with or without a diagonal. The resulting object is a square matrix. read.clipboard.upper for reading input of an upper triangular matrix. read.clipboard.fwf for reading in fixed width fields (some very old data sets) For example, given a data set copied to the clipboard from a spreadsheet, just enter the command > my.data <- read.clipboard() This will work if every data field has a value and even missing data are given some values (e.g., NA or -999). If the data were entered in a spreadsheet and the missing values were just empty cells, then the data should be read in as a tab delimited or by using the read.clipboard.tab function. > my.data <- read.clipboard(sep=" ") #define the tab option, or > my.tab.data <- read.clipboard.tab() #just use the alternative function For the case of data in fixed width fields (some old data sets tend to have this format), copy to the clipboard and then specify the width of each field (in the example below, the first variable is 5 columns, the second is 2 columns, the next 5 are 1 column the last 4 are 3 columns). > my.data <- read.clipboard.fwf(widths=c(5,2,rep(1,5),rep(3,4)) Once the data are read in, then describe or describe.by will provide basic descrip- tive statistics arranged in a data frame format. Consider the data set sat.act which includes data from 700 web based participants on 3 demographic variables and 3 ability measures. describe reports means, standard deviations, medians, min, max, range, skew, kurtosis and standard errors for integer or real data. Non-numeric data will produce an error. describe.by reports descriptive statistics broken down by some categorizing variable (e.g., gender, age, etc.) > library(psych)

> data(sat.act)

> describe(sat.act) #basic descriptive statistics

var n mean sd median trimmed mad min max range skew kurtosis se

gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61 -1.62 0.02

education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68 -0.06 0.05

age 3 700 25.59 9.50 22 23.86 5.93 13 65 52 1.64 2.47 0.36

ACT 4 700 28.55 4.82 29 28.84 4.45 3 36 33 -0.66 0.56 0.18

SATV 5 700 612.23 112.90 620 619.45 118.61 200 800 600 -0.64 0.35 4.27

SATQ 6 687 610.22 115.64 620 617.25 118.61 200 800 600 -0.59 0.00 4.41

These data may then be analyzed by groups defined in a logical statement or by some other
variable. E.g., break down the descriptive data for males or females. These descriptive
data can also be seen graphically using the error.bars.by function (Figure 3). By setting
skew=FALSE and ranges=FALSE, the output is limited to the most basic statistics.

> describe.by(sat.act,sat.act$gender,skew=FALSE,ranges=FALSE) #basic descriptive statistics by a grouping variable.

group: 1

var n mean sd se

gender 1 247 1.00 0.00 0.00

education 2 247 3.00 1.54 0.10

age 3 247 25.86 9.74 0.62

ACT 4 247 28.79 5.06 0.32

SATV 5 247 615.11 114.16 7.26

SATQ 6 245 635.87 116.02 7.41

———————————————————————————————————————————

group: 2

var n mean sd se

gender 1 453 2.00 0.00 0.00

education 2 453 3.26 1.35 0.06

age 3 453 25.45 9.37 0.44

ACT 4 453 28.42 4.69 0.22

SATV 5 453 610.66 112.31 5.28

SATQ 6 442 596.00 113.07 5.38

The output from the describe.by function can be forced into a matrix form for easy analysis
by other programs. In addition, describe.by can group by several grouping variables at the
same time.

> sa.mat <- describe.by(sat.act,list(sat.act$gender,sat.act$education), + skew=FALSE,ranges=FALSE,mat=TRUE) > head(sa.mat)

item group1 group2 var n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

gender5 5 1 2 1 23 1 0 0

gender6 6 2 2 1 21 2 0 0

> tail(sa.mat)

item group1 group2 var n mean sd se

SATQ7 67 1 3 6 79 642.5949 118.28329 13.307910

SATQ8 68 2 3 6 190 590.8895 114.46472 8.304144

SATQ9 69 1 4 6 51 635.9020 104.12190 14.579982

SATQ10 70 2 4 6 86 597.5930 106.24393 11.456578

SATQ11 71 1 5 6 46 657.8261 89.60811 13.211995

SATQ12 72 2 5 6 93 606.7204 105.55108 10.945137

3.1.1 Basic data cleaning using scrub

If, after describing the data it is apparent that there were data entry errors that need to
be globally replaced with NA, or only certain ranges of data will be analyzed, the data can
be “cleaned” using the scrub function.

Consider the attitude data set (which does not need to be cleaned, but will be used as
an example). All values of columns 3 – 5 that are less than 30, 40, or 50 respectively, or
greater than 70 in any column will be replaced with NA. In addition, any value exactly
equal to 45 will be set to NA. (max and isvalue are set to one value here, but they could
be a different value for every column).

> x <- matrix(1:120,ncol=10,byrow=TRUE) > new.x <- scrub(x,3:5,min=c(30,40,50),max=70,isvalue=45) > new.x

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 1 2 NA NA NA 6 7 8 9 10

[2,] 11 12 NA NA NA 16 17 18 19 20

[3,] 21 22 NA NA NA 26 27 28 29 30

[4,] 31 32 33 34 35 36 37 38 39 40

[5,] 41 42 43 44 NA 46 47 48 49 50

[6,] 51 52 53 54 55 56 57 58 59 60

[7,] 61 62 63 64 65 66 67 68 69 70

[8,] 71 72 NA NA NA 76 77 78 79 80

[9,] 81 82 NA NA NA 86 87 88 89 90

[10,] 91 92 NA NA NA 96 97 98 99 100

[11,] 101 102 NA NA NA 106 107 108 109 110

[12,] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decrease, and the minimums have
gone up but the maximums down. Data cleaning and examination for outliers should be a
routine part of any data analysis.

3.2 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well as
communicating important results. Scatter Plot Matrices (SPLOMS) using the pairs.panels
function are useful ways to look for strange effects involving outliers and non-linearities.
error.bars.bi will show group means with 95% confidence boundaries.

3.2.1 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data. The pairs.panels
function, adapted from the help menu for the pairs function produces xy scatter plots of
each pair of variables below the diagonal, shows the histogram of each variable on the
diagonal, and shows the lowess locally fit regression line as well. An ellipse around the
mean with the axis length reflecting one standard deviation of the first and second principal
components is also drawn. The x axis in each scatter plot represents the column variable,
the y axis the row variable (Figure 1).

pairs.panels will show the pairwise scatter plots of all the variables as well as his-
tograms, locally smoothed regressions, and the Pearson correlation.

Another example of pairs.panels is to show differences between experimental groups.
Consider the data in the affect data set. The scores reflect post test scores on positive
and negative affect and energetic and tense arousal. The colors show the results for four
movie conditions: depressing, frightening movie, neutral, and a comedy.

3.2.2 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data, as
well as to draw error bars in both the x and y directions for paired data. These are the
functions

error.bars show the 95 % confidence intervals for each variable in a data frame or matrix.

error.bars.by does the same, but grouping the data by some condition.

error.crosses draw the confidence intervals for an x set and a y set of the same size.

The use of the error.bars.by function allows for graphic comparisons of different groups
(see Figure 3). Five personality measures are shown as a function of high versus low scores
on a “lie” scale. People with higher lie scores tend to report being more agreeable, consci-
entious and less neurotic than people with lower lie scores. The error bars are based upon
normal theory and thus are symmetric rather than reflect any skewing in the data.

> png( ‘pairspanels.png’ )
> pairs.panels(sat.act)

> dev.off()

quartz

Figure 1: Using the pairs.panels function to graphically show relationships. The x axis
in each scatter plot represents the column variable, the y axis the row variable. Note the
extreme outlier for the ACT.

> png(‘affect.png’)
> pairs.panels(flat[15:18],bg=c(“red”,”black”,”white”,”blue”)[maps$Film],pch=21,main=”Affect varies by movies (study ‘flat’)”)
> dev.off()

quartz

Figure 2: Using the pairs.panels function to graphically show relationships. The x axis in
each scatter plot represents the column variable, the y axis the row variable. The coloring
represent four different movie conditions.

> data(epi.bfi)

> error.bars.by(epi.bfi[,6:10],epi.bfi$epilie<4) 0.95% confidence limits Independent Variable D e p e n d e n t V a ri a b le bfagree bfcon bfext bfneur bfopen 0 5 0 1 0 0 1 5 0 ● ● ● ● ● Figure 3: Using the error.bars.by function shows that self reported personality scales on the Big Five Inventory vary as a function of the Lie scale on the EPI. 11 Although not recommended, it is possible to use the error.bars function to draw bar graphs with associated error bars. (This kind of‘dynamite plot (Figure 4) can be very misleading in that the scale is arbitrary. Go to a discussion of the problems in presenting data this way at http://emdbolker.wikidot.com/blog:dynamite. > error.bars.by(sat.act[5:6],sat.act$gender,bars=TRUE,labels=c(“Male”,”Female”),ylab=”SAT score”,xlab=””)

Male Female

0.95% confidence limits

S
A
T

s
co

2
0

0
3

0
0

4
0

0
5

0
0

6
0

0
7

0
0

8
0

0
2

0
0

3
0

0
4

0
0

5
0

0
6

0
0

7
0

0
8

0
0

Figure 4: A “Dynamite plot” of SAT scores as a function of gender is one way of misleading
the reader. By using a bar graph, the range of scores is ignored.

3.2.3 Back to back histograms

The bi.bars function summarize the characteristics of two groups (e.g., males and females)
on a second variable (e.g., age) by drawing back to back histograms (see Figure 5).

http://emdbolker.wikidot.com/blog:dynamite

> data(bfi)

> with(bfi,{bi.bars(age,gender,ylab=”Age”,main=”Age by males and females”)})

Age by males and females

count

A
g

−100 −50 0 50 100

Age by males and females

A
g

−100 −50 0 50 100

0
2

0
4

0
6

0
8

0
1

0
0

−100 −50 0 50 100

Figure 5: A bar plot of the age distribution for males and females shows the use of bi.bars.
The data are males and females from 2800 cases collected using the SAPA procedure and
are available as part of the bfi data set.

3.2.4 Correlational structure

It is also possible to see the structure in a correlation matrix by forming a matrix shaded
to represent the magnitude of the correlation. This is useful when considering the number
of factors in a data set. Consider the Thurstone data set which has a clear 3 factor
solution (Figure 6) or a simulated data set of 24 variables with a circumplex structure
(Figure 7).

Yet another way to show structure is to use “spider” plots. Particularly if variables are
ordered in some meaningful way (e.g., in a circumplex), a spider plot will show this structure
easily. This is just a plot of the magnitude of the correlation as a radial line, with length
ranging from 0 (for a correlation of -1) to 1 (for a correlation of 1).

3.3 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to test
whether these correlations differ from zero, or differ from each other. The cor.test func-
tion (in the stats package) will test the significance of a single correlation, and the rcorr
function in the Hmisc package will do this for many correlations. In the psych package, the
corr.test function reports the correlation (Pearson or Spearman) between all variables in
either one or two data frames or matrices, as well as the number of observations for each
case, and the (two-tailed) probability for each correlation. These probability values have
not been corrected for multiple comparisons and so should be taken with a great deal of
salt.

> corr.test(sat.act)

Call:corr.test(x = sat.act)

Correlation matrix

gender education age ACT SATV SATQ

gender 1.00 0.09 -0.02 -0.04 -0.02 -0.17

education 0.09 1.00 0.55 0.15 0.05 0.03

age -0.02 0.55 1.00 0.11 -0.04 -0.03

ACT -0.04 0.15 0.11 1.00 0.56 0.59

SATV -0.02 0.05 -0.04 0.56 1.00 0.64

SATQ -0.17 0.03 -0.03 0.59 0.64 1.00

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

> png(‘corplot.png’)
> cor.plot(Thurstone,main=”9 cognitive variables from Thurstone”)

> dev.off()

quartz

Figure 6: The structure of correlation matrix can be seen more clearly if the variables are
grouped by factor and then the correlations are shown by color.

> png(‘circplot.png’)
> circ <- sim.circ(24) > r.circ <- cor(circ) > cor.plot(r.circ,main=’24 variables in a circumplex’)
> dev.off()

quartz

Figure 7: Using the cor.plot function to show the correlations in a circumplex. Correlations
are highest near the diagonal, diminish to zero further from the diagonal, and the increase
again towards the corners of the matrix. Circumplex structures are common in the study
of affect.

> png(‘spider.png’)
> op<- par(mfrow=c(2,2)) > spider(y=c(1,6,12,18),x=1:24,data=r.circ,fill=TRUE,main=”Spider plot of 24 circumplex variables”)

> op <- par(mfrow=c(1,1)) > dev.off()

quartz

Figure 8: A spider plot can show circumplex structure very clearly. Circumplex structures
are common in the study of affect.

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability value

gender education age ACT SATV SATQ

gender 0.00 0.02 0.58 0.33 0.62 0.00

education 0.02 0.00 0.00 0.00 0.22 0.36

age 0.58 0.00 0.00 0.00 0.26 0.37

ACT 0.33 0.00 0.00 0.00 0.00 0.00

SATV 0.62 0.22 0.26 0.00 0.00 0.00

SATQ 0.00 0.36 0.37 0.00 0.00 0.00

Testing the difference between any two correlations can be done using the r.test function.
The function actually does four different tests, depending upon the input:

1) For a sample size n, find the t and p value for a single correlation as well as the confidence
interval.

> r.test(50,.3)

Correlation tests

Call:r.test(n = 50, r12 = 0.3)

Test of significance of a correlation

t value 2.18 with probability < 0.034 and confidence interval 0.02 0.53 2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference between the z transformed correlations divided by the standard error of the difference of two z scores. > r.test(30,.4,.6)

Correlation tests

Call:r.test(n = 30, r12 = 0.4, r34 = 0.6)

Test of difference between two independent correlations

z value 0.99 with probability 0.32

3) For sample size n, and correlations ra= r12, rb= r23 and r13 specified, test for the
difference of two dependent correlations (Steiger case A).

> r.test(103,.4,.5,.1)

Correlation tests

Call:r.test(n = 103, r12 = 0.4, r34 = 0.5, r23 = 0.1)

Test of difference between two correlated correlations

t value -0.89 with probability < 0.37 18 4) For sample size n, test for the difference between two dependent correlations involving different variables. (Steiger case B). > r.test(103,.5,.6,.7,.5,.5,.8) #steiger Case B

Correlation tests

Call:r.test(n = 103, r12 = 0.5, r34 = 0.6, r23 = 0.7, r13 = 0.5, r14 = 0.5,

r24 = 0.8)

Test of difference between two dependent correlations

z value -1.2 with probability 0.23

To test whether a matrix of correlations differs from what would be expected if the popu-
lation correlations were all zero, the function cortest follows Steiger (1980) who pointed
out that the sum of the squared elements of a correlation matrix, or the Fisher z score
equivalents, is distributed as chi square under the null hypothesis that the values are zero
(i.e., elements of the identity matrix). This is particularly useful for examining whether
correlations in a single matrix differ from zero or for comparing two matrices. Although
obvious, cortest can be used to test whether the sat.act data matrix produces non-zero
correlations (it does). This is a much more appropriate test when testing whether a residual
matrix differs from zero.

> cortest(sat.act)

Tests of correlation matrices

Call:cortest(R1 = sat.act)

Chi Square value 1325.42 with df = 15 with probability < 1.8e-273 3.4 Polychoric, tetrachoric, polyserial, and biserial correlations The Pearson correlation of dichotomous data is also known as the φ coefficient. If the data, e.g., ability items, are thought to represent an underlying continuous although latent variable, the φ will underestimate the value of the Pearson applied to these latent variables. One solution to this problem is to use the tetrachoric correlation which is based upon the assumption of a bivariate normal distribution that has been cut at certain points. The draw.tetra function demonstrates the process (Figure 9). A simple generalization of this to the case of the multiple cuts is the polychoric correlation. Other estimated correlations based upon the assumption of bivariate normality with cut points include the biserial and polyserial correlation. If the data are a mix of continuous, polytomous and dichotomous variables, the mixed.cor function will calculate the appropriate mixture of Pearson, polychoric, tetrachoric, biserial, and polyserial correlations. 19 > draw.tetra()

−3 −2 −1 0 1 2 3

−
3

−
2

−
1

0
1

2
3

Y rho = 0.5
phi = 0.33

X > τ
Y > Τ

X < τ Y > Τ

X > τ
Y < Τ X < τ Y < Τ x d n o rm (x ) X > τ

Y > Τ

Figure 9: The tetrachoric correlation estimates what a Pearson correlation would be given
a two by two table of observed values assumed to be sampled from a bivariate normal
distribution. The φ correlation is just a Pearson r performed on the observed values.

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and of
scales and for finding various estimates of scale reliability. These may be considered as
problems of dimension reduction (e.g., factor analysis, cluster analysis, principal compo-
nents analysis) and of forming and estimating the reliability of the resulting composite
scales.

4.1 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictum
commonly attributed to William of Ockham to not multiply entities beyond necessity1. The
goal for parsimony is seen in psychometrics as an attempt either to describe (components)
or to explain (factors) the relationships between many observed variables in terms of a
more limited set of components or latent factors.

The typical data matrix represents multiple items or scales usually thought to reflect fewer
underlying constructs2. At the most simple, a set of items can be be thought to represent
a random sample from one underlying domain or perhaps a small set of domains. The
question for the psychometrician is how many domains are represented and how well does
each item represent the domains. Solutions to this problem are examples of factor analysis
(FA), principal components analysis (PCA), and cluster analysis (CA). All of these pro-
cedures aim to reduce the complexity of the observed data. In the case of FA, the goal is
to identify fewer underlying constructs to explain the observed data. In the case of PCA,
the goal can be mere data reduction, but the interpretation of components is frequently
done in terms similar to those used when describing the latent variables estimated by FA.
Cluster analytic techniques, although usually used to partition the subject space rather
than the variable space, can also be used to group variables to reduce the complexity of
the data by forming fewer and more homogeneous sets of tests or items.

At the data level the data reduction problem may be solved as a Singular Value Decom-
position of the original matrix, although the more typical solution is to find either the
principal components or factors of the covariance or correlation matrices. Given the pat-
tern of regression weights from the variables to the components or from the factors to the
variables, it is then possible to find (for components) individual component or cluster scores
or estimate (for factors) factor scores.

1Although probably neither original with Ockham nor directly stated by him (Thorburn, 1918), Ock-
ham’s razor remains a fundamental principal of science.

2Cattell (1978) as well as MacCallum et al. (2007) argue that the data are the result of many more
factors than observed variables, but are willing to estimate the major underlying factors.

Several of the functions in psych address the problem of data reduction.

fa incorporates five alternative algorithms: minres factor analysis, principal a

Published by admin

Leave a Reply Cancel reply