/
Field Profiling Field Profiling

Field Profiling - PowerPoint Presentation

natalia-silvester
natalia-silvester . @natalia-silvester
Follow
388 views
Uploaded On 2017-09-14

Field Profiling - PPT Presentation

Productivity Top Journals Top Researchers Measuring Scholarly Impact in the field of Semantic Web Data 44157 papers with 651673 citations from Scopus 19752009 and 22951 papers with 571911 citations from WOS ID: 587637

data mds clustering cluster mds data cluster clustering variables values dimensions distances cases distance analysis record hierarchical spss information scaling ind clusters

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "Field Profiling" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

Field ProfilingSlide2

Productivity

Top Journals

Top Researchers

Measuring Scholarly Impact in the field of Semantic Web

Data: 44,157

papers with 651,673 citations from Scopus

(1975-2009),

and 22,951

papers

with 571,911 citations from WOS (

1960-2009)Slide3

Impact through citation

Top Journals

Top Researchers

ImpactSlide4

Rising Stars

In WOS, M

. A. Harris (Gene Ontology-related

research), T. Harris (design and implementation of programming languages) and L. Ding (Swoogle – Semantic Web Search Engine) are ranked as the top three authors with the highest increase of citations.

In Scopus, D. Roman (Semantic Web Services), J. De Bruijn (logic programming) and L. Ding (Swoogle) are ranked as top three for the significant increase in number of citations.

Ding, Y. (2010). Semantic Web: Who is Who in the field,

Journal of Information Science

, 36(3): 335-356. Slide5

Data collection

Section 1Slide6

Step 1:Data collection

Using journals

Using keywords

ExampleINFORMATION RETRIEVAL, INFORMATION STORAGE and RETRIEVAL, QUERY PROCESSING, DOCUMENT RETRIEVAL, DATA RETRIEVAL, IMAGE RETRIEVAL, TEXT RETRIEVAL, CONTENT BASED RETRIEVAL, CONTENT-BASED RETRIEVAL, DATABASE QUERY, DATABASE QUERIES, QUERY LANGUAGE, QUERY

LANGUAGES, and RELEVANCE FEEDBACK.StepsSlide7

Go to IU web of sciencehttp://

libraries.iub.edu/resources/wos

For example,

Select Core Collectionsearch “information Retrieval” for topics, for all yearsWeb of ScienceSlide8

Web of ScienceSlide9

OutputSlide10

OutputSlide11

Download Python: https://www.python.org/downloads

/

In order to run Python flawlessly, you might have to change certain environment settings in Windows

.In short, your path is:My Computer ‣ Properties ‣ Advanced ‣ Environment VariablesIn this dialog, you can add or modify User and System variables. To change System variables, you need non-restricted access to your machine (i.e. Administrator rights).User variable: C:\Program Files (x86)\Python27\Lib

;Or go to command line using “Set” and “echo %path%”PythonSlide12

Python Script for conversion

#!/

usr

/bin/env python# encoding: utf-8"""conwos.pyconvert WOS file into format."""

import sysimport osimport repaper = 'paper.tsv' reference = 'reference.tsv'defsource = 'source'

def main(): global defdestination global

defsource source = raw_input('What is the name of source folder?\n') if

len

(source) < 1:

source =

defsource

files =

os.listdir

(source)

fpaper

= open(paper, 'w')

fref

= open(reference, 'w')

uid

= 0

for name in files:

if name[-3:] != "txt":

continue

fil = open('%s\%s' % (source, name))

print '%s is processing...' % name

first = True

Conwos1.pySlide13

Python Script for conversion

for line in fil:

line = line[:-1] if first == True: first = False else: uid += 1 record = str

(uid) + "\t" refs = "" elements = line.split('\t') for i in range(len(elements)): element = elements[

i] if i == 1: authors =

element.split('; ') for j in range(5): if j < len(authors):

record += authors[j] + "\t"

else:

record += "\t"

elif

i

== 29:

refs = element

refz

=

getRefs

(refs)

for ref in

refz

:

fref.write

(

str

(

uid

) + "\t" + ref + "\n")

continue

record += element + "\t"

fpaper.write

(record[:-1] + "\n")

fil.close

()

fpaper.close

()

fref.close

()Slide14

Python Script for conversion

def

getRefs(refs): refz = [] reflist = refs.split('; ')

for ref in reflist: record = "" segs = ref.split(", ") author = "" ind = -1 if

len(segs) == 0: continue for

seg in segs: ind

+= 1

if

isYear

(

seg

):

record += author[:-2] + "\t" +

seg

+ "\t"

break

else:

author +=

seg

+ ", "

ind

+= 1

if

ind

<

len

(

segs

):

if not

isVol

(

segs

[

ind

]) and not

isPage

(

segs

[

ind

]):

record +=

segs

[

ind

] + "\t"

ind

+= 1

else:

record += "\t"

else:

record += "\t"Slide15

Python Script for conversion

if

ind

< len(segs): if isVol(segs[

ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t"

if ind < len(

segs): if isPage(segs

[

ind

]):

record +=

segs

[

ind

][1:] + "\t"

ind

+= 1

else:

record += "\t"

else:

record += "\t"

if record[0] != "\t":

refz.append

(record[:-1])

return

refzSlide16

Python Script for conversion

def

isYear(episode): pattern = '^\d{4}$' regx = re.compile(pattern)

match = regx.search(episode) if match != None: return Truedef isVol(episode): pattern = '^V\d+$' regx =

re.compile(pattern) match = regx.search(episode) if match != None:

return Truedef isPage(episode):

pattern = '^P\d+$'

regx

=

re.compile

(pattern)

match =

regx.search

(episode)

if match != None:

return True

if __name__ == '__main__':

main()Slide17

Using python script: conwos1.pyOutput: paper.tsv

,

reference.tsv

Convert output to databaseSlide18

Paper.tsv

Convert output to databaseSlide19

Reference.tsv

Convert output to databaseSlide20

Import data from external data at Access

Load them to AccessSlide21

Paper table

Access TablesSlide22

Citation table

Access TablesSlide23

Productivity & impact

Section 2Slide24

Top AuthorsFind duplicate records (Query template)

ProductivitySlide25

Top JournalsFind duplicate records (Query template)

ProductivitySlide26

Top OrganizationsFind duplicate records (Query template)

ProductivitySlide27

Highly cited authors

Find duplicate records (Query template)

ImpactSlide28

Highly cited journals

Find duplicate records (Query template)

ImpactSlide29

Highly cited articles

Find duplicate records (Query template)

ImpactSlide30

What are other indicators to measure productivity and impact:Time

Journal impact factor

Journal category

Keyword… think about something in-depth, what are your new indicators?Other indicatorsSlide31

Author-cocitation network

Section 3Slide32

First select the set of authors with whom you want to build up the matrix

Select top 100 highly cited authors

Top 100 highly cited authorsSlide33

Author

Cocitation

NetworkSlide34

Author

Cocitation

NetworkSlide35

Author

Cocitation

NetworkSlide36

Load the network to SPSSSlide37

Load the network to SPSSSlide38

clustering

Section 4Slide39

Aim: create clusters of items that have similarity with others in the same cluster and differences with those outside of the cluster.

So to create similarity within cluster and difference between clusters.

Items are called cases in SPSS.

There are no dependent variables for cluster analysisClustering AnalysisSlide40

The degree of similarity and dissimilarity is measured by distance between casesEuclidean Distance measures the length of a straight line between two cases

The numeric value of distance should be at the same measurement scale.

If it is based on different measurement scales,

Transform to the same scaleOr create a distance matrix firstClustering AnalysisSlide41

Hierarchical clustering does not need a decision on the number of cluster first, good for a small set of casesK-means does need # of clusters first, good for a large set of cases

Clustering Slide42

Hierarchical ClusteringSlide43

Hierarchical ClusteringSlide44

Data. The

variables can be quantitative, binary, or count data.

Scaling

of variables is an important issue--differences in scaling may affect your cluster solution(s). If your variables have large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), you should consider standardizing them (this can be done automatically by the Hierarchical Cluster Analysis procedure).

Hierarchical Clustering: DataSlide45

Case Order C

luster

solution may depend on the order of cases in the file.

You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution.Hierarchical Clustering: DataSlide46

Assumptions.

The

distance or similarity measures used should be appropriate for the data analyzed.

Also, you should include all relevant variables in your analysis. Omission of influential variables can result in a misleading solution. Because hierarchical cluster analysis is an exploratory method, results should be treated as tentative until they are confirmed with an independent sample.

Hierarchical Clustering: DataSlide47

Nearest neighbor or single linkageThe dissimilarity between cluster A and B is represented by the minimum of all possible distances between cases in A and B

Furthest neighbor or complete linkage

The dissimilarity between cluster A and B is represented by the

maximum of all possible distances between cases in A and BBetween-groups linkage or average linkageThe dissimilarity between cluster A and B is represented by the average of all possible distances between cases in A and

BWithin-groups linkageThe dissimilarity between cluster A and B is represented by the average of all the possible distances between the cases within a single new cluster determined by combining cluster A and B.

Hierarchical Clustering: MethodSlide48

Centroid clustering

The

dissimilarity between cluster A and

B is represented by the distance between the centroid for the cases in cluster A and the centroid for the cases in cluster B.Ward’s methodThe dissimilarity between cluster A and B is represented by the “loss of information” from joining the two clusters with this loss of information being measured by the increase in error sum of squares.

Median clusteringThe dissimilarity between cluster A and cluster B is represented by the distance between the SPSS determined median for the cases in cluster A and the median for the cases in cluster B. Hierarchical Clustering: Method

All three methods should use squared Euclidean distance rather than Euclidean distanceSlide49

Euclidean distance. The square root of the sum of the squared differences between values for the items. This is the default for interval data.

Squared

Euclidean distance. The sum of the squared differences between the values for the items.

Pearson correlation. The product-moment correlation between two vectors of values. Cosine. The cosine of the angle between two vectors of values. Chebychev. The maximum absolute difference between the values for the items. Block. The sum of the absolute differences between the values of the item. Also known as Manhattan distance.

Minkowski. The pth root of the sum of the absolute differences to the pth power between the values for the items. Customized. The rth root of the sum of the absolute differences to the pth

power between the values for the items.

Measure for IntervalSlide50

Z scores. Values are standardized to z scores, with a mean of 0 and a standard deviation of 1. Range -1 to 1. Each value for the item being standardized is divided by the range of the values.

Range 0 to 1. The procedure subtracts the minimum value from each item being standardized and then divides by the range.

Maximum magnitude of 1. The procedure divides each value for the item being standardized by the maximum of the values.

Mean of 1. The procedure divides each value for the item being standardized by the mean of the values. Standard deviation of 1. The procedure divides each value for the variable or case being standardized by the standard deviation of the values.

Transform valuesSlide51

Hierarchical Clustering: MethodSlide52

Identify relatively homogeneous groups of cases (or variables) based on selected characteristics, using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left.

Distance or similarity measures are generated by the Proximities procedure

Hierarchical ClusteringSlide53

Hierarchical Clustering: StatisticsSlide54

Hierarchical Clustering: Statistics

Agglomeration

schedule

Displays the cases or clusters combined at each stage, the distances between the cases or clusters being combined, and the last cluster level at which a case (or variable) joined the cluster.Proximity matrixGives the distances or similarities between items.

Cluster MembershipDisplays the cluster to which each case is assigned at one or more stages in the combination of clusters. Available options are single solution and range of solutions.Slide55

Dendrogramscan

be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep.

Icicle plots

display information about how cases are combined into clusters at each iteration of the analysis. (User can specify a range of clusters to be displayed) Orientation: a vertical or horizontal plot.

Hierarchical Clustering: PlotSlide56

Hierarchical Clustering: PlotSlide57

Hierarchical Clustering: Result

Dendrogram

using Average Linkage

(Between Groups)Slide58

Dendrogram with Ward linkageSlide59

K-Means can handle large number of casesBut it requires users to specific # of clusters

K-Means ClusteringSlide60

Iterate and classify# of iteration

Convergence criteria

Classify only

No iterationK-Means Clustering: MethodSlide61

Cluster Centers initial

cluster centers and the

file

(if required), which contains the final cluster centers. Read initial from: we specify the file which contains the initial cluster centers, and

in Write final as: we specify the file which contains the final cluster centers. K-Means Clustering: MethodSlide62

IterateMaximum Iterations (

no

more

than 999) Convergence CriterionBy default 10 iterations and convergence criterion 0 are given. Use running means Yes:

cluster centers change after the addition of each object. No: cluster centers are calculated after all objects have been allocated to a given cluster.K-Means Clustering: MethodSlide63

K-Means Clustering: MethodSlide64

The output will show the following information:initial

cluster centers,

ANOVA

table. Each case distance from cluster center.K-Means Clustering: StatisticsSlide65

K-Means Clustering: MethodSlide66

K-MeansSlide67

K-Means: Result

Initial Cluster Centers

Vectors with their values based on the # of cluster variables.Slide68

K-Means: ResultSlide69

Multidimensional scaling

WEEK12

Courtesy:

Angelina Anastasova, Natalia Jaworska from University of OttawaSlide70

Multidimensional Scaling (MDS): What Is It?

Generally regarded as

exploratory data

analysis

.Reduces large amounts of data into easy-to-visualize structures.Attempts to

find structure (visual representation) in a set of distance measures, e.g. dis/similarities, between objects/cases.

Shows how variables/objects are related perceptually.

How? By assigning cases to specific locations in space.

Distances between points in space match dis/similarities as closely as possible:

Similar objects: Close points Dissimilar objects: Far apart pointsSlide71

MDS Example: City Distances

Distances

Matrix

:

Symmetric

Spatial Map

Dimensions

1: North/South

2: East/West

ClusterSlide72

The Process of MDS: The Data

D

ata of MDS:

similarities

, dissimilarities, distances

, or proximities

reflects amount of dis/similarity or distance between pairs of objects. Distinction between similarity and dissimilarity data dependent on type of scale used:

Dissimilarity

scale: Low

=

high similarity &

High

=high

dissimilarity.

Similarity

scale: Opposite of dissimilarity.

E.g. On a scale of 1-9 (1 being the same and 9 completely different) how similar are chocolate bars A and B? Dissimilarity scale.

SPSS requires dissimilarity scales.Slide73

Data Collection for MDS (1)

Direct/raw data

: Proximities’ values directly obtained from empirical, subjective scaling.

E.g. Rating or ranking dis/similarities (Likert scales).

Indirect/derived data: Computed from other measurements: correlations or confusion data (based on mistakes) (Davidson, 1983).

Data

collection: Pairwise comparison, grouping/sorting tasks, direct ranking, objective method (e.g. city distances).Pairwise comparisons

: All object pairs randomly presented:

# of pairs =

n

(

n

-1)/2,

n

= # of objects/cases

Can be tedious and inefficient process. Slide74

Type of MDS Models (1)

MDS model classified according to:

1) Type of proximities:

Metric/quantitative

: Quantitative information/interval data about objects’ proximities e.g. city distance.Non-metric/qualitative: Qualitative information/nominal data about proximities e.g. rank order.

2) Number of proximity matrices (distance, dis/similarity matrix).

Proximity matrix is the input for MDS.

The above criteria yield:

1)

Classical

MDS: One proximity matrix (metric or non-metric).

2)

Replicated

MDS: Several matrices.

3)

Weighted

MDS/Individual

Difference

Scaling

: Aggregate proximities and individual differences in a common MDS space.Slide75

Types of MDS (2)

More typical in Social Sciences is the classification of MDS based on nature of responses:

1)

Decompositional

MDS: Subjects rate objects on an overall basis, an “impression,” without reference to objective attributes.

Production of a spatial configuration for an individual and a composite map for group.

2) Compositional

MDS: Subjects rate objects

on a variety of specific, pre-specified attributes

(e.g. size).

No maps for individuals, only composite maps. Slide76

Classical MDS uses Euclidean principles to model data proximities in geometrical space, where distance (

d

ij

) between points i and j

is defined as: xi and x

j specify coordinates of points i

and j on dimension a, respectively.

The modeled Euclidean distances are related to the observed proximities,

ij

, by some transformation/function (

f

).

Most MDS

models

assume that the data have the form:

ij

=

f

(

d

ij

)

All MDS algorithms are a variation of the above

(Davidson, 1983).

The MDS ModelSlide77

Output of MDS

MDS Map/Perceptual Map/Spatial Representation

:

1)

Clusters:

Groupings in a MDS spatial representation. These may represent a domain/subdomain.

2)

Dimensions

: Hidden structures in data. Ordered groupings that explain similarity between items.

Axes are meaningless and orientation is arbitrary.

In theory, there is no limit to the number of dimensions.

In reality, the number of dimensions that can be perceived and interpreted is limited. Slide78

Diagnostics of MDS (1)

MDS attempts to find a spatial configuration

X

such

that the following is true: f(δij) ≈ dij

(X)

Stress (Kruskal’s

)

function

: Measures degree of correspondence between distances among points on the MDS map and the matrix input.

Proportion of variance of disparities

not

accounted for by the model:

Range 0-1: Smaller stress = better representation.

None-zero stress: Some/all distances in the map are distortions of the input data.

Rule of thumb

: ≤0.1 is excellent; ≥0.15 not tolerable.Slide79

R

2

(

RSQ): Proportion of variance of the disparities accounted for by the MDS procedure. R

2≥0.6 is an acceptable fit.Weirdness Index: Correspondence of subject’s map and the aggregate map 

outlier identification.

Range 0-1: 0 indicates that subject’s weights are proportional to the average subject’s weights; as the subject’s score becomes more extreme, index approaches 1.Shepard Diagram

: Scatterplot of input proximities (X-axis) against output distances (Y-axis) for every pair of items.

Step-line produced. If map distances fall on the step-line this indicates that input proximities are perfectly reproduced by the MDS model (dimensional solution).

Diagnostics of MDS (2)Slide80

Interpretation of Dimensions

Squeezing data into 2-D enables “readability” but may not be appropriate: Poor, distorted representation of the data (high stress).

Scree plot

: Stress vs.

number of dimensions. E.g. cities distance

Primary objective in dimension interpretation: Obtain best fit with the smallest number of possible dimensions.

How does one assign “meaning” to dimensions? Slide81

Meaning of Dimensions

Subjective Procedures

:

Labelling the dimensions by visual inspection, subjective interpretation, and information from respondents.

“Experts” evaluate and identify the dimensions. Slide82

Validating MDS Results

Split-sample comparison:

O

riginal sample is divided and a correlation between the variables is conducted.

Multi-sample comparison: New sample is collected and a correlation is conducted between the old and new data. Comparisons are done visually or with a simple correlation of coordinates or variables.

Assessing whether MDS solution (dimensionality extraction) changes in a substantial

way.Slide83

MDS Caveats

Respondents

may attach different levels of importance to a dimension.

Importance of a dimension may change over time.

Interpretation of dimensions is subjective.Generally, more than four times as many objects as dimensions should be compared for the MDS model to be stable.Slide84

“Advantages” of MDS

Dimensionality

“solution” can be obtained from individuals; gives insight into how individuals differ from aggregate data.

Reveals dimensions without the need for defined attributes.

Dimensions that emerge from MDS can be incorporated into regression analysis to assess their relationship with other variables.Slide85

“Disadvantages” of MDS

Provides a global measure of dis/similarity but does not provide much insight into subtleties

(Street et al., 2001).

Increased dimensionality: Difficult to represent and decreases intuitive understanding of the data. As such, the model of the data becomes as complicated as the data itself.

Determination of meanings of dimensions is subjective.Slide86
Slide87

“SPSSing” MDS

In the SPSS Data Editor window, click:

Analyze

>

Scale

>

Multidimensional ScalingSlide88

Select four or more

Variables

that you want to test.

You may select a single variable for the

Individual Matrices for

window (depending on the distances option selected).

Slide89

If

Data are distances

(e.g. cities distances) option is selected, click on the

Shape

button to define characteristic of the dissimilarities/proximity matrices.

If

Create distance from data

is

selected, click on the

Measure

button to control the computation of dissimilarities, to transform values, and to compute distances.

Slide90

In the Multidimensional Scaling dialog box, click on the

Model

button to control the level of measurement, conditionality, dimensions, and the scaling model.

Click on the

Options

button to control the display options, iteration criteria, and treatment of missing values.

Slide91

MDS: A Psychological Example

Multidimensional scaling modelling approach to latent profile analysis in psychological research

(Ding, 2006)Basic premise: Utilize MDS to investigate types or profiles of people.“Profile:” From applied psych where test batteries are used to extract and construct distinctive features/characteristics in people.

MDS method was used to:Derive profiles (dimensions) that could provide information regarding psychosocial adjustment patterns in adolescents.

Assess if individuals could follow different profile patterns than those extracted from group data, i.e. deviations from the derived normative profiles.Slide92

Study Details: Methodology

Participants

: College students (µ=23 years, n=208).

Instruments

: Self-Image Questionnaire for Young Adolescents (SIQYA). Variables: Body Image (BI), Peer Relationships (PR), Family Relationships (FR), Mastering & Coping (MC), Vocational-Educational Goals (VE), and Superior Adjustment (SA)Three mental health measures of well-being:

Kandel Depression ScaleUCLA Loneliness Scale

Life Satisfaction ScaleSlide93

Data for MDS

Scored data for MDS profile analysis

Sample data for 14 individuals:

BI=body image, PR=peer relations, FR=family relations, MC=mastery & coping, VE=vocational & educational goal, SA=superior adjustment, PMI-1=profile match index for Profile 1, PMI-2=profile match index for Profile 2, LS=life satisfaction, Dep=depression, PL=psychological lonelinessSlide94

The Analysis: Step by Step

Step 1

: Estimate the number of profiles (dimensions) from the latent variables.

Kruskal's stress = 0.00478

Excellent stress value.

RSQ = 0.9998

Configuration derived in 2 dimensions.Slide95

Scale values of two MDS profiles (dimensions) in psychosocial adjustment.

Normative profiles of psychosocial adjustments in young adults.

Each profile represents prototypical individual.Slide96

References

Davidson, M. L. (1983).

Multidimensional scaling

. New York: J. Wiley and Sons.

Ding, C. S. (2006). Multidimensional scaling modelling approach to latent profile analysis in psychological research. International Journal of Psychology 41 (3), 226-238.Kruskal, J.B. & Wish M.1978.

Multidimensional Scaling. Sage.

Street, H., Sheeran, P., & Orbell, S. (2001). Exploring the relationship between different psychosocial determinants of depression: a multidimensional scaling analysis. Journal of Affective Disorders 64

,

53–67.

Takane, Y., Young, F.W., & de Leeuw, J. (1977). Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features,

Psychometrika

42 (1)

, 7–67.

Young, F.W., Takane, Y., & Lewyckyj, R. (1978). Three notes on ALSCAL,

Psychometrika

43 (3)

, 433–435.

http://www.analytictech.com/borgatti/profit.htm

http://www2.chass.ncsu.edu/garson/pa765/mds.htm

http://www.terry.uga.edu/~pholmes/MARK9650/Classnotes4.pdfSlide97

MDS- SPSSSlide98

MDS- SPSSSlide99

MDS- SPSSSlide100

MDS- SPSSSlide101

MDS- SPSSSlide102

MDS- SPSSSlide103

MDS- SPSSSlide104

Combining MDS with Clustering methodsDraw clusters on MDS plots using MS Paint

Identify cluster labels

A field mapSlide105

Mapping the field of IR

Author Co-citation

Map in the field of Information Retrieval (1992-1997)

Data: 1,466

IR-related papers was selected from 367 journals with 44,836 citations.Slide106

McCain, K. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American Society for Information Science & Technology, 41(6), 433-443.

Ding

, Y., Chowdhury, G. and Foo, S. (1999). Mapping Intellectual Structure of Information Retrieval: An Author

Cocitation Analysis, 1987-1997, Journal of Information Science, 25(1): 67-78. ExamplesSlide107

Factor analysisSlide108

To identify underlying variables, or factors, that explain the pattern of correlations within a set of observed variables.It is a data reduction to identify a small number of factors that explain most of the variance observed in a large number of variables.

Assumption: variables or cases should be independent (we can use correlation to check whether some variables are dependent)

Factor AnalysisSlide109
Slide110

The Coefficients option produces the

R-matrix

, and the

Significance levels option will produce a matrix indicating the significance value of each correlation in the R-matrix. You can also ask for the Determinant of this matrix and this option is vital for testing for multicollinearity

or singularity. DescriptiveSlide111
Slide112

The scree plot was described earlier and is a useful

way

of establishing how many factors should be

retained in an analysis. The unrotated factor solution is useful in assessing the improvement of interpretation due to rotation. If the rotated

solution is little better than the unrotated solution then it is possible that an inappropriate (or less optimal) rotation method has been used. ExtractionSlide113
Slide114

The interpretability of factors can be improved through rotation.

Rotation

maximizes the

loading of each variable on one of the extracted factors whilst minimizing the loading on all other factors.Rotation works through changing the absolute values of the variables whilst

keeping their differential values constant. If you expect the factors to be independent then you should choose one of the orthogonal rotations (varimax).

RotationSlide115
Slide116

This option allows you to save factor scores for

each subject

in the data editor.

SPSS creates a new column for each factor extracted and then places the factor score for each subject within that column. These scores can then be used for

further analysis, or simply to identify groups of subjects who score highly on particular factors.ScoreSlide117
Slide118

SPSS will list variables in the order in which

they

are entered into the data editor.

Although this format is often convenient, when interpreting factors it can be useful to list variables by size. By selecting Sorted by size, SPSS

will order the variables by their factor loadings. There is also the option to Suppress absolute values less than a specified value (by default 0.1).

This option ensures that factor loadings within ±0.1 are not displayed in the output. This option is

useful for assisting in interpretation; OptionsSlide119

It should be clear that the first few factors

explain

relatively large

amounts of variance (especially factor 1) whereas subsequent factors explain only small amounts of variance. SPSS then extracts all

factors with eigenvalues greater than 1, now we have 23 factorsThe eigenvalues associated with these factors are displayed in the columns labelled Extraction Sums of Squared Loadings. Slide120

ResultSlide121

Factor and its members

Pick up loading (absolute value)>0.7 to interpret the factor, with >0.4 to report as the members for the factorSlide122

Scree PlotSlide123

Mapping the field of IR

Ding, Y., Chowdhury, G. and Foo, S. (1999). Mapping Intellectual Structure of Information Retrieval: An Author

Cocitation

Analysis, 1987-1997,

Journal of Information Science, 25(1): 67-78. Slide124

Trying other networks:Co-author network, journal co-citation network

Trying factor analysis for your survey data

Broader Thinking