# 程序代写代做代考 DNA data structure algorithm An overview of the psych package

An overview of the psych package

William Revelle

Department of Psychology

Northwestern University

August 11, 2011

Contents

1 Overview of this and related documents 2

2 Getting started 3

3 Basic data analysis 4

3.1 Data input and descriptive statistics . . . . . . . . . . . . . . . . . . . . . . 4

3.1.1 Basic data cleaning using scrub . . . . . . . . . . . . . . . . . . . . 7

3.2 Simple descriptive graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.1 Scatter Plot Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.2 Means and error bars . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.2.3 Back to back histograms . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.4 Correlational structure . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Testing correlations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.4 Polychoric, tetrachoric, polyserial, and biserial correlations . . . . . . . . . . 19

4 Item and scale analysis 21

4.1 Dimension reduction through factor analysis and cluster analysis . . . . . . 21

4.1.1 Minimum Residual Factor Analysis . . . . . . . . . . . . . . . . . . . 23

4.1.2 Principal Axis Factor Analysis . . . . . . . . . . . . . . . . . . . . . 25

4.1.3 Weighted Least Squares Factor Analysis . . . . . . . . . . . . . . . . 25

4.1.4 Principal Components analysis . . . . . . . . . . . . . . . . . . . . . 30

4.1.5 Hierarchical and bi-factor solutions . . . . . . . . . . . . . . . . . . . 30

4.1.6 Item Cluster Analysis: iclust . . . . . . . . . . . . . . . . . . . . . . 30

4.2 Confidence intervals using bootstrapping techniques . . . . . . . . . . . . . 37

4.3 Comparing factor/component/cluster solutions . . . . . . . . . . . . . . . . 37

4.4 Determining the number of dimensions to extract. . . . . . . . . . . . . . . 37

1

4.4.1 Very Simple Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.4.2 Parallel Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.5 Factor extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6 Reliability analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.6.1 Reliability of a single scale . . . . . . . . . . . . . . . . . . . . . . . 46

4.6.2 Using omega to find the reliability of a single scale . . . . . . . . . . 50

4.6.3 Estimating ωh using Confirmatory Factor Analysis . . . . . . . . . . 54

4.6.4 Other estimates of reliability . . . . . . . . . . . . . . . . . . . . . . 56

4.6.5 Reliability of multiple scales within an inventory . . . . . . . . . . . 57

4.7 Item analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5 Item Response Theory analysis 63

5.1 Factor analysis and Item Response Theory . . . . . . . . . . . . . . . . . . . 63

5.2 Speeding up analyses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6 Set Correlation and Multiple Regression from the correlation matrix 68

7 Simulation functions 72

8 Graphical Displays 74

9 Data sets 74

10 Development version and a users guide 77

11 Psychometric Theory 77

12 SessionInfo 77

1 Overview of this and related documents

The psych package (Revelle, 2011) has been developed at Northwestern University to in-

clude functions most useful for personality, psychometric, and psychological research. Some

of the functions (e.g., read.clipboard, describe, pairs.panels, error.bars) are useful

for basic data entry and descriptive analyses.

Psychometric applications include routines for five types of factor analysis (fa does mini-

mum residual , principal axis, weighted least squares, generalized least squares and maximum

likelihood factor analysis). Determining the number of factors or components to extract

may be done by using the Very Simple Structure (Revelle and Rocklin, 1979) (vss), Min-

imum Average Partial correlation (Velicer, 1976) (MAP) or parallel analysis fa.parallel

2

criteria. Item Response Theory models for dichotomous or polytomous items may be found

by factoring tetrachoric or polychoric correlation matrices and expressing the result-

ing parameters in terms of location and discrimination irt.fa. Bifactor and hierarchical

factor structures may be estimated by using Schmid Leiman transformations (Schmid and

Leiman, 1957) (schmid) to transform a hierarchical factor structure into a bifactor solution

(Holzinger and Swineford, 1937). Scale construction can be done using the Item Cluster

Analysis (Revelle, 1979) (iclust) function to determine the structure and to calculate reli-

ability coefficients α Cronbach (1951)(alpha, score.items, score.multiple.choice), β

(Revelle, 1979; Revelle and Zinbarg, 2009) (iclust) and McDonald’s ωh and ωt (McDon-

ald, 1999) (omega). Guttman’s six estimates of internal consistency reliability (Guttman

(1945), as well as additional estimates (Revelle and Zinbarg, 2009) are in the guttman func-

tion and the six measures of Intraclass correlation coefficients (ICC) discussed by Shrout

and Fleiss (1979) are also available.

Graphical displays include Scatter Plot Matrix (SPLOM) plots using pairs.panels, factor,

cluster, and structural diagrams using fa.diagram, iclust.diagram, structure.diagram,

as well as item response characteristics and item and test information characteristic curves

plot.irt.

This vignette is meant to give an overview of the psych package. That is, it is meant

to give a summary of the main functions in the psych package with examples of how

they are used for data description, dimension reduction, and scale construction. The

extended user manual at psych_manual.pdf includes examples of graphic output and

more extensive demonstrations than are found in the help menus. (Also available at

http://personality-project.org/r/psych_manual.pdf). The vignette, psych for sem,

at psych_for_sem.pdf, discusses how to use psych as a front end to the sem package of

John Fox (Fox, 2009). (The vignette is also available at http://personality-project.

org/r/book/psych_for_sem.pdf).

For a step by step tutorial in the use of the psych package and the base functions in

R for basic personality research, see the guide for using R for personality research at

http://personalitytheory.org/r/r.short.html. For an introduction to psychometric

theory with applications in R, see the draft chapters at http://personality-project.

org/r/book).

2 Getting started

Some of the functions described in this overview require other packages. Particularly

useful for rotating the results of factor analyses (from e.g., fa, factor.minres, factor.pa,

factor.wls, or principal) or hierarchical factor models using omega or schmid, is the

GPArotation package. For some analyses of correlations of dichotomous data, the polycor

3

psych_manual.pdf

psych_for_sem.pdf

“http://personality-project.org/r/book/psych_for_sem.pdf”

“http://personality-project.org/r/book/psych_for_sem.pdf”

http://personalitytheory.org/r/r.short.html

http://personality-project.org/r/book

http://personality-project.org/r/book

package is suggested in order to use either the poly.mat or phi2poly functions. However,

these functions have been supplement with tetrachoric and polychoric which do not

require the polycor package. These and other useful packages may be installed by first

installing and then using the task views (ctv) package to install the “Psychometrics” task

view, but doing it this way is not necessary.

install.packages(“ctv”)

library(ctv)

task.views(“Psychometrics”)

Because of the difficulty of installing the package Rgraphviz , alternative graphics have been

developed and are available as diagram functions. If Rgraphviz is available, some functions

will take advantage of it.

3 Basic data analysis

A number of psych functions facilitate the entry of data and finding basic descriptive

statistics.

Remember, to run any of the psych functions, it is necessary to make the package active

by using the library command:

> library(psych)

The other packages, once installed, will be called automatically by psych.

It is possible to automatically load psych and other functions by creating and then saving

a “.First” function:

.First <- function(x) {library(psych)} 3.1 Data input and descriptive statistics There are of course many ways to enter data into R. Reading from a local file using read.table is perhaps the most preferred. However, many users will enter their data in a text editor or spreadsheet program and then want to copy and paste into R. This may be done by using read.table and specifying the input file as “clipboard” (PCs) or “pipe(pbpaste)” (Macs). Alternatively, the read.clipboard set of functions are perhaps more user friendly: read.clipboard is the base function for reading data from the clipboard. read.clipboard.csv for reading text that is comma delimited. 4 read.clipboard.tab for reading text that is tab delimited. read.clipboard.lower for reading input of a lower triangular matrix with or without a diagonal. The resulting object is a square matrix. read.clipboard.upper for reading input of an upper triangular matrix. read.clipboard.fwf for reading in fixed width fields (some very old data sets) For example, given a data set copied to the clipboard from a spreadsheet, just enter the command > my.data <- read.clipboard() This will work if every data field has a value and even missing data are given some values (e.g., NA or -999). If the data were entered in a spreadsheet and the missing values were just empty cells, then the data should be read in as a tab delimited or by using the read.clipboard.tab function. > my.data <- read.clipboard(sep=" ") #define the tab option, or > my.tab.data <- read.clipboard.tab() #just use the alternative function For the case of data in fixed width fields (some old data sets tend to have this format), copy to the clipboard and then specify the width of each field (in the example below, the first variable is 5 columns, the second is 2 columns, the next 5 are 1 column the last 4 are 3 columns). > my.data <- read.clipboard.fwf(widths=c(5,2,rep(1,5),rep(3,4)) Once the data are read in, then describe or describe.by will provide basic descrip- tive statistics arranged in a data frame format. Consider the data set sat.act which includes data from 700 web based participants on 3 demographic variables and 3 ability measures. describe reports means, standard deviations, medians, min, max, range, skew, kurtosis and standard errors for integer or real data. Non-numeric data will produce an error. describe.by reports descriptive statistics broken down by some categorizing variable (e.g., gender, age, etc.) > library(psych)

> data(sat.act)

> describe(sat.act) #basic descriptive statistics

var n mean sd median trimmed mad min max range skew kurtosis se

gender 1 700 1.65 0.48 2 1.68 0.00 1 2 1 -0.61 -1.62 0.02

education 2 700 3.16 1.43 3 3.31 1.48 0 5 5 -0.68 -0.06 0.05

age 3 700 25.59 9.50 22 23.86 5.93 13 65 52 1.64 2.47 0.36

5

ACT 4 700 28.55 4.82 29 28.84 4.45 3 36 33 -0.66 0.56 0.18

SATV 5 700 612.23 112.90 620 619.45 118.61 200 800 600 -0.64 0.35 4.27

SATQ 6 687 610.22 115.64 620 617.25 118.61 200 800 600 -0.59 0.00 4.41

These data may then be analyzed by groups defined in a logical statement or by some other

variable. E.g., break down the descriptive data for males or females. These descriptive

data can also be seen graphically using the error.bars.by function (Figure 3). By setting

skew=FALSE and ranges=FALSE, the output is limited to the most basic statistics.

> describe.by(sat.act,sat.act$gender,skew=FALSE,ranges=FALSE) #basic descriptive statistics by a grouping variable.

group: 1

var n mean sd se

gender 1 247 1.00 0.00 0.00

education 2 247 3.00 1.54 0.10

age 3 247 25.86 9.74 0.62

ACT 4 247 28.79 5.06 0.32

SATV 5 247 615.11 114.16 7.26

SATQ 6 245 635.87 116.02 7.41

———————————————————————————————————————————

group: 2

var n mean sd se

gender 1 453 2.00 0.00 0.00

education 2 453 3.26 1.35 0.06

age 3 453 25.45 9.37 0.44

ACT 4 453 28.42 4.69 0.22

SATV 5 453 610.66 112.31 5.28

SATQ 6 442 596.00 113.07 5.38

The output from the describe.by function can be forced into a matrix form for easy analysis

by other programs. In addition, describe.by can group by several grouping variables at the

same time.

> sa.mat <- describe.by(sat.act,list(sat.act$gender,sat.act$education), + skew=FALSE,ranges=FALSE,mat=TRUE) > head(sa.mat)

item group1 group2 var n mean sd se

gender1 1 1 0 1 27 1 0 0

gender2 2 2 0 1 30 2 0 0

gender3 3 1 1 1 20 1 0 0

gender4 4 2 1 1 25 2 0 0

gender5 5 1 2 1 23 1 0 0

gender6 6 2 2 1 21 2 0 0

6

> tail(sa.mat)

item group1 group2 var n mean sd se

SATQ7 67 1 3 6 79 642.5949 118.28329 13.307910

SATQ8 68 2 3 6 190 590.8895 114.46472 8.304144

SATQ9 69 1 4 6 51 635.9020 104.12190 14.579982

SATQ10 70 2 4 6 86 597.5930 106.24393 11.456578

SATQ11 71 1 5 6 46 657.8261 89.60811 13.211995

SATQ12 72 2 5 6 93 606.7204 105.55108 10.945137

3.1.1 Basic data cleaning using scrub

If, after describing the data it is apparent that there were data entry errors that need to

be globally replaced with NA, or only certain ranges of data will be analyzed, the data can

be “cleaned” using the scrub function.

Consider the attitude data set (which does not need to be cleaned, but will be used as

an example). All values of columns 3 – 5 that are less than 30, 40, or 50 respectively, or

greater than 70 in any column will be replaced with NA. In addition, any value exactly

equal to 45 will be set to NA. (max and isvalue are set to one value here, but they could

be a different value for every column).

> x <- matrix(1:120,ncol=10,byrow=TRUE) > new.x <- scrub(x,3:5,min=c(30,40,50),max=70,isvalue=45) > new.x

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]

[1,] 1 2 NA NA NA 6 7 8 9 10

[2,] 11 12 NA NA NA 16 17 18 19 20

[3,] 21 22 NA NA NA 26 27 28 29 30

[4,] 31 32 33 34 35 36 37 38 39 40

[5,] 41 42 43 44 NA 46 47 48 49 50

[6,] 51 52 53 54 55 56 57 58 59 60

[7,] 61 62 63 64 65 66 67 68 69 70

[8,] 71 72 NA NA NA 76 77 78 79 80

[9,] 81 82 NA NA NA 86 87 88 89 90

[10,] 91 92 NA NA NA 96 97 98 99 100

[11,] 101 102 NA NA NA 106 107 108 109 110

[12,] 111 112 NA NA NA 116 117 118 119 120

Note that the number of subjects for those columns has decrease, and the minimums have

gone up but the maximums down. Data cleaning and examination for outliers should be a

routine part of any data analysis.

7

3.2 Simple descriptive graphics

Graphic descriptions of data are very helpful both for understanding the data as well as

communicating important results. Scatter Plot Matrices (SPLOMS) using the pairs.panels

function are useful ways to look for strange effects involving outliers and non-linearities.

error.bars.bi will show group means with 95% confidence boundaries.

3.2.1 Scatter Plot Matrices

Scatter Plot Matrices (SPLOMS) are very useful for describing the data. The pairs.panels

function, adapted from the help menu for the pairs function produces xy scatter plots of

each pair of variables below the diagonal, shows the histogram of each variable on the

diagonal, and shows the lowess locally fit regression line as well. An ellipse around the

mean with the axis length reflecting one standard deviation of the first and second principal

components is also drawn. The x axis in each scatter plot represents the column variable,

the y axis the row variable (Figure 1).

pairs.panels will show the pairwise scatter plots of all the variables as well as his-

tograms, locally smoothed regressions, and the Pearson correlation.

Another example of pairs.panels is to show differences between experimental groups.

Consider the data in the affect data set. The scores reflect post test scores on positive

and negative affect and energetic and tense arousal. The colors show the results for four

movie conditions: depressing, frightening movie, neutral, and a comedy.

3.2.2 Means and error bars

Additional descriptive graphics include the ability to draw error bars on sets of data, as

well as to draw error bars in both the x and y directions for paired data. These are the

functions

error.bars show the 95 % confidence intervals for each variable in a data frame or matrix.

error.bars.by does the same, but grouping the data by some condition.

error.crosses draw the confidence intervals for an x set and a y set of the same size.

The use of the error.bars.by function allows for graphic comparisons of different groups

(see Figure 3). Five personality measures are shown as a function of high versus low scores

on a “lie” scale. People with higher lie scores tend to report being more agreeable, consci-

entious and less neurotic than people with lower lie scores. The error bars are based upon

normal theory and thus are symmetric rather than reflect any skewing in the data.

8

> png( ‘pairspanels.png’ )

> pairs.panels(sat.act)

> dev.off()

quartz

2

Figure 1: Using the pairs.panels function to graphically show relationships. The x axis

in each scatter plot represents the column variable, the y axis the row variable. Note the

extreme outlier for the ACT.

9

> png(‘affect.png’)

> pairs.panels(flat[15:18],bg=c(“red”,”black”,”white”,”blue”)[maps$Film],pch=21,main=”Affect varies by movies (study ‘flat’)”)

> dev.off()

quartz

2

Figure 2: Using the pairs.panels function to graphically show relationships. The x axis in

each scatter plot represents the column variable, the y axis the row variable. The coloring

represent four different movie conditions.

10

> data(epi.bfi)

> error.bars.by(epi.bfi[,6:10],epi.bfi$epilie<4) 0.95% confidence limits Independent Variable D e p e n d e n t V a ri a b le bfagree bfcon bfext bfneur bfopen 0 5 0 1 0 0 1 5 0 ● ● ● ● ● Figure 3: Using the error.bars.by function shows that self reported personality scales on the Big Five Inventory vary as a function of the Lie scale on the EPI. 11 Although not recommended, it is possible to use the error.bars function to draw bar graphs with associated error bars. (This kind of‘dynamite plot (Figure 4) can be very misleading in that the scale is arbitrary. Go to a discussion of the problems in presenting data this way at http://emdbolker.wikidot.com/blog:dynamite. > error.bars.by(sat.act[5:6],sat.act$gender,bars=TRUE,labels=c(“Male”,”Female”),ylab=”SAT score”,xlab=””)

Male Female

0.95% confidence limits

S

A

T

s

co

re

2

0

0

3

0

0

4

0

0

5

0

0

6

0

0

7

0

0

8

0

0

2

0

0

3

0

0

4

0

0

5

0

0

6

0

0

7

0

0

8

0

0

Figure 4: A “Dynamite plot” of SAT scores as a function of gender is one way of misleading

the reader. By using a bar graph, the range of scores is ignored.

3.2.3 Back to back histograms

The bi.bars function summarize the characteristics of two groups (e.g., males and females)

on a second variable (e.g., age) by drawing back to back histograms (see Figure 5).

12

http://emdbolker.wikidot.com/blog:dynamite

> data(bfi)

> with(bfi,{bi.bars(age,gender,ylab=”Age”,main=”Age by males and females”)})

Age by males and females

count

A

g

e

−100 −50 0 50 100

Age by males and females

A

g

e

−100 −50 0 50 100

0

2

0

4

0

6

0

8

0

1

0

0

−100 −50 0 50 100

Figure 5: A bar plot of the age distribution for males and females shows the use of bi.bars.

The data are males and females from 2800 cases collected using the SAPA procedure and

are available as part of the bfi data set.

13

3.2.4 Correlational structure

It is also possible to see the structure in a correlation matrix by forming a matrix shaded

to represent the magnitude of the correlation. This is useful when considering the number

of factors in a data set. Consider the Thurstone data set which has a clear 3 factor

solution (Figure 6) or a simulated data set of 24 variables with a circumplex structure

(Figure 7).

Yet another way to show structure is to use “spider” plots. Particularly if variables are

ordered in some meaningful way (e.g., in a circumplex), a spider plot will show this structure

easily. This is just a plot of the magnitude of the correlation as a radial line, with length

ranging from 0 (for a correlation of -1) to 1 (for a correlation of 1).

3.3 Testing correlations

Correlations are wonderful descriptive statistics of the data but some people like to test

whether these correlations differ from zero, or differ from each other. The cor.test func-

tion (in the stats package) will test the significance of a single correlation, and the rcorr

function in the Hmisc package will do this for many correlations. In the psych package, the

corr.test function reports the correlation (Pearson or Spearman) between all variables in

either one or two data frames or matrices, as well as the number of observations for each

case, and the (two-tailed) probability for each correlation. These probability values have

not been corrected for multiple comparisons and so should be taken with a great deal of

salt.

> corr.test(sat.act)

Call:corr.test(x = sat.act)

Correlation matrix

gender education age ACT SATV SATQ

gender 1.00 0.09 -0.02 -0.04 -0.02 -0.17

education 0.09 1.00 0.55 0.15 0.05 0.03

age -0.02 0.55 1.00 0.11 -0.04 -0.03

ACT -0.04 0.15 0.11 1.00 0.56 0.59

SATV -0.02 0.05 -0.04 0.56 1.00 0.64

SATQ -0.17 0.03 -0.03 0.59 0.64 1.00

Sample Size

gender education age ACT SATV SATQ

gender 700 700 700 700 700 687

education 700 700 700 700 700 687

age 700 700 700 700 700 687

14

> png(‘corplot.png’)

> cor.plot(Thurstone,main=”9 cognitive variables from Thurstone”)

> dev.off()

quartz

2

Figure 6: The structure of correlation matrix can be seen more clearly if the variables are

grouped by factor and then the correlations are shown by color.

15

> png(‘circplot.png’)

> circ <- sim.circ(24)
> r.circ <- cor(circ)
> cor.plot(r.circ,main=’24 variables in a circumplex’)

> dev.off()

quartz

2

Figure 7: Using the cor.plot function to show the correlations in a circumplex. Correlations

are highest near the diagonal, diminish to zero further from the diagonal, and the increase

again towards the corners of the matrix. Circumplex structures are common in the study

of affect.

16

> png(‘spider.png’)

> op<- par(mfrow=c(2,2))
> spider(y=c(1,6,12,18),x=1:24,data=r.circ,fill=TRUE,main=”Spider plot of 24 circumplex variables”)

> op <- par(mfrow=c(1,1)) > dev.off()

quartz

2

Figure 8: A spider plot can show circumplex structure very clearly. Circumplex structures

are common in the study of affect.

17

ACT 700 700 700 700 700 687

SATV 700 700 700 700 700 687

SATQ 687 687 687 687 687 687

Probability value

gender education age ACT SATV SATQ

gender 0.00 0.02 0.58 0.33 0.62 0.00

education 0.02 0.00 0.00 0.00 0.22 0.36

age 0.58 0.00 0.00 0.00 0.26 0.37

ACT 0.33 0.00 0.00 0.00 0.00 0.00

SATV 0.62 0.22 0.26 0.00 0.00 0.00

SATQ 0.00 0.36 0.37 0.00 0.00 0.00

Testing the difference between any two correlations can be done using the r.test function.

The function actually does four different tests, depending upon the input:

1) For a sample size n, find the t and p value for a single correlation as well as the confidence

interval.

> r.test(50,.3)

Correlation tests

Call:r.test(n = 50, r12 = 0.3)

Test of significance of a correlation

t value 2.18 with probability < 0.034 and confidence interval 0.02 0.53 2) For sample sizes of n and n2 (n2 = n if not specified) find the z of the difference between the z transformed correlations divided by the standard error of the difference of two z scores. > r.test(30,.4,.6)

Correlation tests

Call:r.test(n = 30, r12 = 0.4, r34 = 0.6)

Test of difference between two independent correlations

z value 0.99 with probability 0.32

3) For sample size n, and correlations ra= r12, rb= r23 and r13 specified, test for the

difference of two dependent correlations (Steiger case A).

> r.test(103,.4,.5,.1)

Correlation tests

Call:r.test(n = 103, r12 = 0.4, r34 = 0.5, r23 = 0.1)

Test of difference between two correlated correlations

t value -0.89 with probability < 0.37 18 4) For sample size n, test for the difference between two dependent correlations involving different variables. (Steiger case B). > r.test(103,.5,.6,.7,.5,.5,.8) #steiger Case B

Correlation tests

Call:r.test(n = 103, r12 = 0.5, r34 = 0.6, r23 = 0.7, r13 = 0.5, r14 = 0.5,

r24 = 0.8)

Test of difference between two dependent correlations

z value -1.2 with probability 0.23

To test whether a matrix of correlations differs from what would be expected if the popu-

lation correlations were all zero, the function cortest follows Steiger (1980) who pointed

out that the sum of the squared elements of a correlation matrix, or the Fisher z score

equivalents, is distributed as chi square under the null hypothesis that the values are zero

(i.e., elements of the identity matrix). This is particularly useful for examining whether

correlations in a single matrix differ from zero or for comparing two matrices. Although

obvious, cortest can be used to test whether the sat.act data matrix produces non-zero

correlations (it does). This is a much more appropriate test when testing whether a residual

matrix differs from zero.

> cortest(sat.act)

Tests of correlation matrices

Call:cortest(R1 = sat.act)

Chi Square value 1325.42 with df = 15 with probability < 1.8e-273 3.4 Polychoric, tetrachoric, polyserial, and biserial correlations The Pearson correlation of dichotomous data is also known as the φ coefficient. If the data, e.g., ability items, are thought to represent an underlying continuous although latent variable, the φ will underestimate the value of the Pearson applied to these latent variables. One solution to this problem is to use the tetrachoric correlation which is based upon the assumption of a bivariate normal distribution that has been cut at certain points. The draw.tetra function demonstrates the process (Figure 9). A simple generalization of this to the case of the multiple cuts is the polychoric correlation. Other estimated correlations based upon the assumption of bivariate normality with cut points include the biserial and polyserial correlation. If the data are a mix of continuous, polytomous and dichotomous variables, the mixed.cor function will calculate the appropriate mixture of Pearson, polychoric, tetrachoric, biserial, and polyserial correlations. 19 > draw.tetra()

−3 −2 −1 0 1 2 3

−

3

−

2

−

1

0

1

2

3

Y rho = 0.5

phi = 0.33

X > τ

Y > Τ

X < τ Y > Τ

X > τ

Y < Τ
X < τ
Y < Τ
x
d
n
o
rm
(x
)
X > τ

τ

x1

Y > Τ

Τ

Figure 9: The tetrachoric correlation estimates what a Pearson correlation would be given

a two by two table of observed values assumed to be sampled from a bivariate normal

distribution. The φ correlation is just a Pearson r performed on the observed values.

20

4 Item and scale analysis

The main functions in the psych package are for analyzing the structure of items and of

scales and for finding various estimates of scale reliability. These may be considered as

problems of dimension reduction (e.g., factor analysis, cluster analysis, principal compo-

nents analysis) and of forming and estimating the reliability of the resulting composite

scales.

4.1 Dimension reduction through factor analysis and cluster analysis

Parsimony of description has been a goal of science since at least the famous dictum

commonly attributed to William of Ockham to not multiply entities beyond necessity1. The

goal for parsimony is seen in psychometrics as an attempt either to describe (components)

or to explain (factors) the relationships between many observed variables in terms of a

more limited set of components or latent factors.

The typical data matrix represents multiple items or scales usually thought to reflect fewer

underlying constructs2. At the most simple, a set of items can be be thought to represent

a random sample from one underlying domain or perhaps a small set of domains. The

question for the psychometrician is how many domains are represented and how well does

each item represent the domains. Solutions to this problem are examples of factor analysis

(FA), principal components analysis (PCA), and cluster analysis (CA). All of these pro-

cedures aim to reduce the complexity of the observed data. In the case of FA, the goal is

to identify fewer underlying constructs to explain the observed data. In the case of PCA,

the goal can be mere data reduction, but the interpretation of components is frequently

done in terms similar to those used when describing the latent variables estimated by FA.

Cluster analytic techniques, although usually used to partition the subject space rather

than the variable space, can also be used to group variables to reduce the complexity of

the data by forming fewer and more homogeneous sets of tests or items.

At the data level the data reduction problem may be solved as a Singular Value Decom-

position of the original matrix, although the more typical solution is to find either the

principal components or factors of the covariance or correlation matrices. Given the pat-

tern of regression weights from the variables to the components or from the factors to the

variables, it is then possible to find (for components) individual component or cluster scores

or estimate (for factors) factor scores.

1Although probably neither original with Ockham nor directly stated by him (Thorburn, 1918), Ock-

ham’s razor remains a fundamental principal of science.

2Cattell (1978) as well as MacCallum et al. (2007) argue that the data are the result of many more

factors than observed variables, but are willing to estimate the major underlying factors.

21

Several of the functions in psych address the problem of data reduction.

fa incorporates five alternative algorithms: minres factor analysis, principal a