Roman Sergienko PhD student Tatiana Gasanova PhD student Ulm University Germany Shaknaz Akhmedova PhD student Siberian State Aerospace University Krasnoyarsk ID: 259116
Download Presentation The PPT/PDF document "Opinion Mining and Topic Categorization ..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Opinion Mining and Topic Categorization with Novel Term Weighting
Roman
Sergienko
,
Ph.D
student
Tatiana
Gasanova
,
Ph.D
student
Ulm University,
Germany
Shaknaz
Akhmedova
,
Ph.D
.
student
Siberian
State Aerospace University,
Krasnoyarsk
,
RussiaSlide2
Contents
Motivation
Databases
Text preprocessing methodsThe novel term weighting methodFeatures selectionClassification algorithmsResults of numerical experimentsConclusions
2Slide3
Motivation
The
goal
of the work is to evaluate the competitiveness of the novel term weighting in comparison with the standard techniques for opining mining and topic categorization.The criteria are:Macro F-measure for the test setComputational time3Slide4
Databases: DEFT’07 and DEFT’08
4
Corpus
SizeClassesBooksTrain size = 2074
Test size = 1386
Vocabulary = 52507
0: negative,
1: neutral,
2: positive
Games
Train size = 2537Test size = 1694Vocabulary = 631440: negative, 1: neutral, 2: positiveDebatesTrain size = 17299Test size = 11533Vocabulary = 596150: against, 1: for
Corpus
Size
Classes
T1
Train size = 15223
Test size = 10596
Vocabulary = 202979
0: Sport,
1: Economy,
2: Art,
3: Television
T2
Train size = 23550
Test size = 15693
Vocabulary = 262400
0: France,
1: International,
2: Literature,
3: Science,
4: SocietySlide5
The existing text preprocessing methods
Binary preprocessing
TF-IDF
(Salton and Buckley, 1988)5
Confident Weights
(
Soucy
and
Mineau
, 2005)Slide6
The novel term weighting method
6
L
– the number of classes; ni – the number of instances of the i-th class; Nji – the number of j-
th
word occurrence in all instances of the
i
-th
class; Tji=Nji/ni – the relative frequency of j-th word occurrence in the i-th class;Rj=maxiTji, Sj=arg(maxiTji) – the number of class which we assign to j-th word. Slide7
Features selection
Calculating a relative frequency for each word in the each class
Choice for each word the class with the maximum relative frequency
For each classification utterance calculating sums of weights of words which belong to each classNumber of attributes = number of classes7Slide8
Classification algorithms
k
-nearest neighbors algorithm with distance weighting (we have varied
k from 1 to 15);kernel Bayes classifier with Laplace correction;neural network with error back propagation (standard setting in RapidMiner);Rocchio classifier with different metrics and parameter;support vector machine (SVM) generated and optimized with Co-Operation of Biology Related Algorithms (COBRA) (Akhmedova and Semenkin, 2013).
8Slide9
Computational effectiveness
9
DEFT’07
DEFT’08Slide10
The best values of F-measure
10
Problem
F-measureThe best known valueTerm weighting methodClassification algorithm
Books
0.619
0.603
The novel TW
SVM
Games
0.7200.784ConfWeightk-NNDebates0.7140.720ConfWeightSVMT10.8560.894The novel TWSVMT20.8510.880
The novel TW
SVMSlide11
Comparison of ConfWeight and the novel term weighting
11
Problem
ConfWeightThe novel TWDifferenceBooks0.588
0.619
+0.031
Games
0.720
0.712
-0.008
Debates0.7140.700-0.014T10.8550.856+0.001T20.8510.820+0.031Slide12
Conclusions
The novel term weighting method gives similar or better classification quality than the
ConfWeight
method but it requires the same amount of time as TF-IDF.12