October 16 2014 Department of Knowledge Service Engineering Prof JaeGil Lee Brief Bio Currently an associate professor at Department of Knowledge Service Engineering KAIST Homepage httpdmkaistackrjaegil ID: 641089
Download Presentation The PPT/PDF document "Expertise Finding for Question Answering..." is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
Expertise Finding for Question Answering (QA) Services
October 16, 2014
Department of Knowledge Service Engineering
Prof. Jae-Gil LeeSlide2
Brief Bio
Currently, an associate professor at Department of Knowledge Service Engineering, KAIST
Homepage: http://dm.kaist.ac.kr/jaegil
Lab homepage: http://dm.kaist.ac.kr/
Previously, worked at IBM
Almaden
Research Center and University of Illinois at Urbana-Champaign
Areas of Interest:
Data Mining
and
Big DataSlide3
Table of Contents
Community-based Question Answering (CQA) Services
Background and Motivation
Methodology Overview
Evaluation Results
Social Search Engines for Location-Based Questions
Background and Motivation
System Architecture and User Interface
Evaluation Results Slide4
Question Answering (QA) Services
QA services are good at
Recently updated information
Personalized information
Advice & opinion
[
Budalakoti
et al., 2010]
Questions Answers
Knowledge
Base
Search
ExpertsSlide5
Community-based Question Answering (CQA) Services
Naver
Knowledge-In Yahoo! Answers
50,000
questions per day
160,000
questions per daySlide6
Motivation of Our Study
Most contributions (i.e., answers) in CQA services are made by a small number of heavy users
Recently-joined
users are prone to leave CQA
services very soon
Only 8.4
% of answerers
remained after a year
Making the long tail stay longer
before they leave
is of prime importance towards the success of the servicesSlide7
Problem Setting
To whom does the service provider need to pay special attention?
Recently-joined
(i.e., light) users who are likely to become
contributive
(i.e., heavy) users
Goal
: estimating the
likelihood of a light user becoming a heavy user (mainly by his/her expertise)Challenges
: lack of information about the light user
어장관리?Slide8
Intuition behind Our Methodology
A person’s
active vocabulary
reveals his/her knowledge
Vocabulary has sharable characteristics so that domain-specific words are repeatedly used by expert answerers
SSD
NAND
ECC
RAM
Device
Memory
Computer
NAND
ECC
RAM
SSD
Operation
Data
Drive
Q&A 1 by Answerer 1
Q&A 2 by Answerer 2
Domain-Specific
Vocabularies
Common
Vocabularies
Level
Difference
Sharable
CharacteristicsSlide9
Estimated Expertise
Heavy Users Words Light Users
The more expert a user is, the higher the level of words he/she used is.Slide10
Availability
Simply measuring the number of a user’s answers with their importance proportional to their
recencySlide11
Answer Affordance
Being defined as the likelihood of a light user becoming a heavy user if he/she is treated specially
Considering
both
expertise
and
availability
Slide12
Data Set
Collected from
Naver
Knowledge-In
(
KiN
,
지식인
)Spanning ten years (from Sept. 2002 to Aug. 2012)
Including two categories: Computers and TravelComputers: factual information, Travel: subjective opinionsThe entropy was used for measuring the expertise of a user, working well especially for the categories where factual expertise is primarily sought after [
Adamic et al., 2008]Statistics
ComputersTravel# of answers3,926,794
585,316# of words191,502
232,076# of users
228,36944,866Slide13
Evaluation Setting (1/2)
Finding the top-k users by
Affordance
()
for light
users
our
methodologyRetrieving the top-k directoryexperts managed by KiN
competitorMeasuring the two measuresfor the next one monthUser availability: the ratio of the number of the top-k users who appeared on the day to the total number of users who appeared on that day
Answer possession: the ratio of the number of the answers posted by the top-k users on the day to the total number of answers posted on that daySlide14
Evaluation Setting (2/2)
Ten year period
Sept. 2002 July 2011 July 2012 Aug. 2012
Used for deriving the word levels
Used for finding top-k experts by our methodology
Picked up the top-k directory experts managed by
KiN
Monitored the user availability and answer possessionSlide15
The result of the
answer possession
The result of the
user availability
(a) Computers (b) Travel
(a) Computers (b) Travel
t
op-400
top-200
t
op-400
top-200Slide16
See the paper for the technical details.
Sung, J.,
Lee, J.
, and Lee, U., "Booming Up the Long Tails: Discovering Potentially Contributive Users in Community-Based Question Answering Services," In
Proc. 7th Int'l AAAI Conf. on Weblogs and Social Media (ICWSM)
, Cambridge, Massachusetts, July
2013.
This paper received the
Best Paper Award
at AAAI ICWSM-13.Slide17
Table of Contents
Community-based Question Answering (CQA) Services
Background and Motivation
Methodology Overview
Evaluation Results
Social Search Engines for Location-Based Questions
Background and Motivation
System
Architecture and User InterfaceEvaluation Results Slide18
Social Search (1/2)
A new paradigm of knowledge acquisition that relies on the people of
a
questioner’s
social networkSlide19
Social Search (2/2)
If you want to get some opinions or advices from your online friends, what do you do?
Not
knowing whom to ask
Knowing whom to ask
Taking advantage of both approaches
Social SearchSlide20
KiN Here (지식인 위치질문
)
A query is routed by finding a match between a target location of a query and a relevant location of a user
동 단위로 추가Slide21
Location-Based Questions
Informally defined
as “search for a business
or place
of interest that is tied to a specific geographical
location”[Amin
et al
., 2009]
Very popular especially in mobile search and typically subjectiveMobile search is estimated to comprise 10%∼30%
of all searches About 9∼10% of the queries from Yahoo! mobile search, over 15% of 1 million Google queries from PDA devices
, and about 10% of 10 million Bing mobile queries were identified as location-based questions
In a set of location-based questions, 63% of them were non-factual, and the remaining 37% of them were factual Mobile social search is the best way to process location-based questionsSlide22
Glaucus: A Social Search Engine for Location-Based Questions
1.
Asking a question to
Glaucus
2.
Selecting proper experts
3.
Routing the question to the experts
4.
Returning an answer to the questioner5. (Optional) Rating the answer
Glaucus
Social Search Engine
User Database
1: Query
Users
2: Selected Experts
3: Query
Answer
4: Answer
5: Feedback
Crawling
QuestionerSlide23
User Interface
An Android app has been developed and is under (closed) beta testing
Questioner AnswererSlide24
Data Collection
Being able to collect
who
visited
where
and
when
on
geosocial networking services such as FoursquareUsers check-in to a venue and also may leave a tipOur crawler collects such information upon user approvalSlide25
Expert Finding
Venue
Location
Category
Time
Misc.
Venue
Location
Category
Time
Misc.
Location Aspect Model
Questioner
Question
Other Users
Online Friend?
Similarity
Calculation
Score
Score
Score
Score
Top-
kSlide26
Evaluation Setting
Collected
check-in’s and tips from
Foursquare (foursquare.com
)
Confined to the places in the Gangnam District
Ranging from April 2012 to December 2012
Statistics
Variable
Value
# of users9,163# of places (venues)
1,220# of check-in’s243,114# of tips40,248Slide27
Evaluation Results
SocialTelescope
Aardvark
Glaucus
DCG
Set 1
Set 2
Set 3
3.94
3.99
4.07
6.61
6.31
6.68
8.25
8.82
7.78
2.37
1.97
Qualification of the Experts:
Two human judges investigated the profiles of the experts selected by the three systems for
30 questions (distributed to 3 sets) and
gave a score in 3 scales.
Quality of the Answers:
Two human judges examined the quality of the answers
―
both from experts and non-experts
―
and gave a score in 3 scales.Slide28
Mobile User Availability
Motivation
Study Methodology
Context
Smart Phone Log
External Information
(Time, Date)
Availability
Classifier
Decision Tree,
SVM,
Random Forest …
26
Features
Class Label
Classification Model
Availability?
Training
Prediction
26
Features
AvailabilitySlide29
User Behavior Collection
분류
데이터 종류
수집 방법
스마트폰
Context Data
배터리
정보
(배터리 잔량
, 충전 여부, 충전 모드)
백그라운드 수집
전화 정보(통화 시작시간,
통화 소요시간, 수신/
발신 여부
)
메시지
정보
(
문자 시간
,
수신
/
발신 여부
)
GPS
정보
(
위도
,
경도
)
기기 정보
(
진동모드
,
무음모드
,
비행기모드
, CPU
사용량
,
헤드폰모드
,
스크린 점등
)
주위
정보
(
주변 조명 밝기
,
주변 소음 세기
)
WIFI
정보
(
WIFI On/Off, SSID,
신호 세기
)
Cellular
정보
(
Cellular On/Off,
신호 세기
)
애플리케이션
정보
(
애플리케이션 이름
,
애플리케이션
구동 시간
)
가용성
Data
특정 시각에서의 응답 가능 여부
직접 입력Slide30
Preliminary Evaluation Results
Accuracy
10-fold cross validation
10 users for 5 weeks
Important Features
1
st
: Time, Day of Week
2
nd: Running Apps3rd: WIFI SSID, # of Apps (30 mins), Time of Day
Model
AccuracyBaseline (Always Available)0.53Naïve Bayesian0.66SVM
0.64KNN
0.62Decision Tree0.64
Adaboost0.61
Random Forest0.7Slide31
See the paper for the technical details.
Choy, M.,
Lee, J.
,
Gweon, G., and Kim, D.,
"
Glaucus
: Exploiting the Wisdom of Crowds for Location-Based Queries in Mobile Environments," In
Proc. 8th Int'l AAAI Conf. on Weblogs and Social Media (ICWSM)
, Ann Arbor, Michigan, June 2014.Slide32
Thank you very much!Any Questions?
E-mail:
jaegil@kaist.ac.kr
Homepage:
http://dm.kaist.ac.kr/