/
description description

description - PDF document

ella
ella . @ella
Follow
344 views
Uploaded On 2021-06-12

description - PPT Presentation

dcat Dataset dct title dct dct publisher dcat theme dcat keyword dcat landingPage dcat Catalog dct title dct description foaf homepage dcat Distribution dct title dct ID: 840635

150 dcat acm dct dcat 150 dct acm usa newyork http dist application sparql org 2014 time debs description

Share:

Link:

Embed:

Download Presentation from below link

Download Pdf The PPT/PDF document "description" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

1 dcat : Dataset dct : title dct : descrip
dcat : Dataset dct : title dct : description dct : publisher dcat : theme dcat : keyword dcat : landingPage ... dcat : Catalog dct : title dct : description foaf : homepage ... dcat : Distribution dct : title dct : description dct : license dct : rights dcat : accessURL dcat : downloadURL dcat : mediaType dct : format ... dcat : dataset dcat : distribution dcat : D istribution ( in Turtle ) @ prefix : https :// api . example . com / : dist _ 1 a dcat : Distribution ; dct : title Example dist 1 ; dct : description DBpedia SPARQL endpoint ; dcat : license http :// creativecommons . org / licenses / by - nc / 2 . 5 / rdf �; dcat : accessURL http :// dbpedia . org / sparql �; dcat : mediaType application / sparql - results + xml . / void / dist _ 1 . ttl � void : inDataset : dist _ 1 . REST ( in JSON ) GET https :// api . example . com / dist _ 1 { title : Example dist 1 , description : DBpedia SPARQL endpoint . , links : [ { href : / dist _ 1 , rel : self , method : GET }, { href : http :// creativecommons . org / licenses / by - nc / 2 . 5 / rdf , rel : license , type : application / rdf + xml , method : GET }, { href : http :// dbpedia . org / sparql , rel : dcat - access , type : application / sparql - results + xml , method : GET }, { href : / void / dist _ 1 . ttl , rel : describedby , type : text / turtle , method : GET } ] } 6.CONCLUDINGREMARKSInthispaperwedescribeanarchitectureforingesting,process-ing,andpublishingreal-timebigdataforWebObservatories.Wealsodescribehowtheapplicationofareal-timeWebObservatorycanbeusedtomeasureseveraldifferentcharacteristicsofasocialThearchitecturepresentedinthispaperprovidesascalableandsecuresolutionformeasuringmultiplesocialmachinesusingacommonschema.FromaWebObservatoryperspective,beingabletoprovidealevelofaccesscontroloverdatasetsandstreamsisafundamentalprincipleinordertoallowdataownerstoretaincon-troloftheirdata.Fromasocialmachinesresearchersperspective,theabilitytoaccessunied,andincertaincircumstances,inte-gratedreal-timestreamsofactivityisaessentialresourcetoun-derstand,analyse,andpossiblymakepredictionsaboutthecurrentstateofasocialmachineshealth.AnimmediatechallengeforSUWO,andthenetworkofWebObservatories,istoimprovethecurrentapproachofintegratingreal-timestreams,andalsotobeabletoqueryincomingstreamsinordertolterthedataforspecicpurposes.Futureworkwillalsoincludeawideranalysisofthecurrentandproposedmetricsformeasuringsocialmachineactivity,andhowtheycontributetounderstandingdifferentclassesofsocialmachines.Wealsowishtoexplorethecombinationofmetricsasameanstomeasuresocialmachineactivity.7.ACKNOWLEDGEMENTSSupportforthisstudywasprovidedbytheSOCIAM:TheThe-oryandPracticeofSocialMachines(EP/J017728/1)grantfromtheUKEngineeringandPhysicalSciencesResearchCouncil(EP-8.REFERENCES[1]Aniello,L.,Baldoni,R.,andQuerzoni,L.Adaptiveonlineschedulinginstorm.InProceedingsofthe7thACMInternationalConferenceonDistributedEvent-based,DEBS'13,ACM(2013),207–218.[2]Artikis,A.,Etzion,O.,Feldman,Z.,andFournier,F.Eventprocessingunderuncertainty.InProceedingsofthe6thACMInternationalConferenceonDistributedEvent-Based,DEBS'12,ACM(2012),32–43.[3]Brown,I.C.,Hall,W.,andHarris,L.Towardsataxonomyforwebobservatories.InProceedingsoftheCompanionPublicationofthe23rdInternationalConferenceonWorldWideWebCompanion,WWWCompanion'14,InternationalWorldWideWebConferencesSteeringCommittee(RepublicandCantonofGeneva,Switzerland,2014),1067–1072.[4]Burnap,P.,Rana,O.,Williams,M.,Housley,W.,Edwards,A.,Morgan,J.,Sloan,L.,andConejero,J.Cosmos:Towardsanintegratedandscalableserviceforanalysingsocialmediaondemand.InternationalJournalofParallel,EmergentandDistributedSystems,ahead-of-print(2014),1–21.[5]Chen,J.,Ramaswamy,L.,andLowenthal,D.Towardsefcienteventaggregationinadecentralizedpublish-subscribesystem.InProceedingsoftheThirdACMInternationalConferenceonDistributedEvent-Based,DEBS'09,ACM(NewYork,NY,USA,2009),[6]Chua,T.-S.,Luan,H.,Sun,M.,andYang,S.Next:Nus-tsinghuacenterforextremesearchofuser-generatedMultiMedia,IEEE19,3(July2012),81–87.[7]Fawcett,T.,andProvost,F.Activitymonitoring:Noticinginterestingchangesinbehavior.InProceedingsoftheFifthACMSIGKDDInternationalConferenceonKn

2 owledgeDiscoveryandDataMining,KDD'99,ACM
owledgeDiscoveryandDataMining,KDD'99,ACM(NewYork,NY,USA,1999),53–62.[8]Frischbier,S.,Margara,A.,Freudenreich,T.,Eugster,P.,Eyers,D.,andPietzuch,P.Asia:Application-specicintegratedaggregationforpublish/subscribemiddleware.InProceedingsofthePostersandDemoTrack,Middleware'12,ACM(NewYork,NY,USA,2012),6:1–6:2.[9]Hall,W.,Tiropanis,T.,Tinati,R.,Wang,X.,Luczak-Rosch,M.,andSimperl,E.Thewebscienceobservatory-thechallengesofanalyticsoverdistributedlinkeddataERCIMNews,96(2014),29–30.[10]Heinze,T.,Aniello,L.,Querzoni,L.,andJerzak,Z.Cloud-baseddatastreamprocessing.InProceedingsofthe8thACMInternationalConferenceonDistributedEvent-BasedSystems,DEBS'14,ACM(NewYork,NY,USA,2014),238–245.[11]Houston,P.Buildingdistributedapplicationswithmessagequeuingmiddleware,1998.[12]Krishnamurthy,S.,Wu,C.,andFranklin,M.On-the-ysharingforstreamedaggregation.InProceedingsofthe2006ACMSIGMODInternationalConferenceonManagementof,SIGMOD'06,ACM(NewYork,NY,USA,2006),[13]Luckham,D.C.ThePowerofEvents:AnIntroductiontoComplexEventProcessinginDistributedEnterprise.Addison-WesleyLongmanPublishingCo.,Inc.,Boston,MA,USA,2001.[14]Maali,F.,andJohnErickson.DataCatalogVocabulary(DCAT),2014.[15]Manyika,J.,Chui,MichaelBrown,B.,Bughin,J.,Dobbs,R.,Roxburgh,C.,andHungByers,A.Bigdata:Thenextfrontierforinnovation,competition,andproductivity.McKinseyGlobalInstitute[16]Pandey,N.K.,Zhang,K.,Weiss,S.,Jacobsen,H.-A.,andVitenberg,R.Distributedeventaggregationforcontent-basedpublish/subscribesystems.InProceedingsofthe8thACMInternationalConferenceonDistributedEvent-BasedSystems,DEBS'14,ACM(NewYork,NY,USA,2014),95–106.[17]Reuter,T.,andCimiano,P.Event-basedclassicationofsocialmediastreams.InProceedingsofthe2NdACMInternationalConferenceonMultimediaRetrieval,ICMR'12,ACM(NewYork,NY,USA,2012),22:1–22:8.[18]Stonebraker,M.,Çetintemel,U.,andZdonik,S.The8requirementsofreal-timestreamprocessing.SIGMODRec.,4(Dec.2005),42–47.[19]Su,H.,Rundensteiner,E.A.,andMani,M.Automatoninorout:Run-timeplanoptimizationforxmlstreamprocessing.Proceedingsofthe2NdInternationalWorkshoponScalableStreamProcessingSystem,SSPS'08,ACM(NewYork,NY,USA,2008),38–47.[20]Tiropanis,T.,Wang,X.,Tinati,R.,andHall,W.Buildingaconnectedwebobservatory:architectureandchallenges.In2ndInternationalWorkshoponBuildingWebObservatories(B-WOW14),ACMWebScienceConference2014[21]Vinoski,S.Advancedmessagequeuingprotocol.InternetComputing10,6(Nov.2006),87–89.[22]Wang,W.,Sharaf,M.A.,Guo,S.,andÖzsu,M.T.Potential-drivenloaddistributionfordistributeddatastreamprocessing.InProceedingsofthe2NdInternationalWorkshoponScalableStreamProcessingSystem,SSPS'08,ACM(NewYork,NY,USA,2008),13–22. 1154 stancesofSUWOcanbefederatedbysimplymergingtheirDCAT4.2.2ProtectingresourcesusingOAuth2.0Notallresourcesareopentothepublic.TheSUWOAPIadoptsOAuth2.0toprovidecomprehensiveyetexibleprotectionforbothpublicandprivateresources.AsshowninFigure5,theSUWOactsasareverseproxyofregisteredresources.Thatis,allrequestsforaccessarecontrolledbytheSUWObeforeaccessingthere-sources.UsingOAuthuserscanauthoriseapplications(eithertheirownorthird-party)toactonbehalfoftheusers,andtheautho-risedapplicationsgainthesamepermissionastheusers.Toaccessresources,authorisedapplicationsrstlyauthenticatethemselvesagainstOAuth2.0,andtheSUWOverieswhethertheapplica-tions-theuserswhoareviewingtheapplication-areallowedtoaccesstheresources. Figure5:Requestersareusersorapplicationsinitiatingre-queststoaccessresources;SUWOauthenticatesrequestersandauthorisestheirrequests,andresources,thataredatasetsandanalyticssharedonthewebobservatory.5.MEASURINGSOCIALMACHINESUSINGSUWOInthissectionwewilladdressthepurposeofdesigningasystemcapableofingestingandrepublishingmultipleWebstreamsinreal-time,anddiscussthetypeofmetricsthatprovideanindicationofasocialmachinesstateofoperation.5.1MeasuringSocialMachinesAcriticalquestionforaWebobservatoryiswhatinsightscansuchaplatformprovide,andhowdoesreal-timeanalysisprovideadditionalvalue?Inresponsetothis,wearguethatinordertosup-port,improve,orevenre-engineerasystem,itiscriticalthatitisrstunderstood.ThefundamentalpurposeofaWebObservatorysuchasSUWO,istoprovideobservationalandanalyticalunder-standingofcurrentstateoftheWebatvariouslevelsofgranularityandtypesofinteraction.AWebObservatorybecomestheinter-mediarytooltomeasureasocialmachineusingvariousdifferentmetricsinordertounderstanditsfunctionalityandstateofopera-Inadditiontothis,real-timeanalysisofsocialmachinesturnstheprocessofobservation,analysis,andreportingintosomethingfarmoreresponsiveandreactive.Bybeingabletomonitorhowasys-temisfunctioninginreal-timeoffersthepotentialtoreacttosuddenchangesintypesofactivity,andfromthis,learningstrategiescanbebuiltwhichcanautomaticallyreachdependingofthefullmentofcertaincriteria.Therefore,itisimportanttodeneasetofmacroandmicrolevelmeasuremen

3 tsandmetricsthatcanbeusedtomea-surediffe
tsandmetricsthatcanbeusedtomea-suredifferentcharacteristicsofasocialmachine.Thefollowingsectionprovidesanumberofmetricsthatareal-timestreamcanprovideandthatweareusingintheSUWO.5.1.1SocialMachineMetricsDescribedbelowareanumberofmetricswearecurrentlyexam-ininginordertocharacterisetheactivityofasocialmachine.Manyofthesemeasurementscanbeusedincombinationwitheachothertoprovideamoredetailedviewofasocialmachinesoperation.Macro-levelsystemactivity.Ametricassociatedwiththeprin-cipletypeofactivitythatasocialmachineisdesignedfor.Inmanycasesthereareanumberofmeasurementsthatcanbeconsideredasaprimaryindicatorofactivity.Insuchcases,thesecanallbeusedtorepresentsystemactivity.Forinstance,consideringamicroblog-gingplatformsuchasTwittermaybemeasuredasthenumberofmessagespostedpersecond,aswellasthenumberofre-sharesbetweenmessages.Otherclassesofsocialmachinesmaybemea-suredasthenumberoftasksperformedpersecond,orthenumberofactiveusersinagiventimeperiod.CommunityEngagementactivity.Manysystemsinvolvetheinteractionbetweenusers,whichmaybecommunications,orsomemodelofafriendshipsocialnetworkgraph(i.e.Facebook,LinkedIn).Theconnectivityofthisnetworkcanactasameasureofaso-cialmachine'sstructure,whichcanactasaproxyfortheunder-lyingcommunitystructure.Forinstance,measuringasocialma-chinessocialgraphmayrevealdistinctcommunitiesofusersop-eratinginisolationfromeachother,which,dependingonthepur-poseofthesystem(i.e.acommunity-crowdsourcedplatformsuchasWikipedia),mayreduceitsperformance,oreffectiveness.User-CentricActivity.Thiscanbeconsideredasamicro-levelmeasurementasitrepresentsameasureofactivitybasedona-po-tentiallyseeded-listofusersofasystem.Thismaymonitortheirfrequencyofactivity(i.e.activesessiondurations),ortheircom-municationpatterns(i.e.whotheyinteractwith).Suchmeasuresmaybeusefulforthosemonitoringasocialmachinesthatcontainsdifferenttypesofusers.Forinstance,incrowdsourcingplatforms,thesemeasurescouldbeusedtoexaminethespeedtowhichnewusersengagewithcommunityorgainthenecessaryskillstocom-pleteanactivity.Topic-CentricActivity.Thistypeofmetricmeasuresemergingtopicswithinasocialmachine.Thismeasurementmayinvolvestudyingthefrequencyofactivityassociatedwithaspecicobject,suchastext,multimedia,ordata.Thesemeasurementsprovideanindicatorofemergenttrendingtopicswhichmaybeusefulasanalertforinitiatingastreamintegrationprocesswithothersocialmachinesunderobservation.Content/SentimentMeasure.Thisprovidesasystem-levelviewofthecurrentcontent,topics,orsentimentofasocialmachine.Thismeasurementisofparticularrelevancetocommunication-basedsocialmachinessuchasmicrobloggingplatforms,asitprovidesanindicatorofthecurrenttopicsofdiscussion,orthecurrentsenti-mentofitsuserbase.InformationFlow.Theowofinformationofaresource(e.g.text,image,video,URL)canbeusedasameasurementtoun-derstandvariouscharacteristicsofasocialmachine.Forinstance,tracingtheowofaword,phrase,orhashtag,inausergeneratedmessage-basedsocialmachine(e.g.Twitter),couldprovideinsightintothecurrenttopicsofinterest,andhowdiversesuchtopichasbecome(topicvirality). 1153 Figure2:SOWOReal-TimeStreamingArchitecture Figure4:Mappingrulesofadcat:DistributiontoitsRESTrep-resentation.Accessingaresourcerequirestwopiecesofinformation:theaccessURLanditsmediatype.TheaccessURLgivesthelo- Acomprehensivelistofmediatypesisavailableat//www.iana.org/assignments/media-types/media-cationwheretheresourceisavailable;themediatypeindicatestheprotocolandprocedureforaccessingtheresource.Theme-diatypeshouldbestandardifpossible,oradenitionoftheme-diatypeshouldbeprovided.Forexample,theresourceshowninFigure4isaSPARQLendpoint(indicatedbythemediatype).ThenapplicationswillknowthattheycansendSPARQLqueriestothegivenURL.ForAMQPstreamsweuseacustommediatype“application/amqp”.Appli-cationsbuiltonmultiplestreamscanselectresourcesofthetype“application/amqp”andlterrelevantonesbasedontheresourcedescriptionsandkeywords.Itisalsopossibletocombinestaticdatasetsandlivestreamsinthesameway.Thereareseveraladvan-tagesoftheDCAT-to-RESTapproach:Richsemantics:DCATprovidestheinformationrequiredtodiscoverandaccessaresource.ThemappingpreservesallsuchinformationintheAPI.Interoperabilityandautomation:DCATisaninteroperablevocabulary,andHATEOASenablesapplicationstoautomaticallytraversethewholeAPIfromasinglestartpoint.Combiningbothgivestheopportunityofautomaticresourcediscoveryandretriev-Reversiblemapping:TheAPIhasallthesemanticsofDCATdocuments,itisstraightforwardtoconstructthesourceDCATdoc-umentsfromtheAPIstructure.Easyfederation:DCATdocumentscanbecombinedintoabig-gerdocumentsandmappedtoaRESTAPI.Thereforedifferentin- http://www.w3.org/TR/rdf-sparql-XMLres/Thereisamediatypeforstreams,“application/octet-stream”.Howeverinourcasethemediatypehastobespecicenoughtoenableapplicationstoautomaticallyaccesstheresource. 1152 4.WEBOBSERVATORY:REAL-TIMEPRO-CESSINGANDPUBLI

4 SHINGInthissectionweintroducethearchitec
SHINGInthissectionweintroducethearchitecture,technologiesandcongurationusedtoprocessandrepublishreal-timedatastreamsintheSUWO.WethendescribetheWebObservatoryAPIanduseofOAuth2.0asasecuritymechanismtoprovidedataownerscon-troloverwhohasaccesstotheirdata.4.1HarvestingMultipleWebStreamsManyWebservicesnowprovideprogrammaticaccesstoplat-formsviaAPIs.DependingontheWebservice,usingtheAPIcanprovidethecompletecollectionofactivities-therehosetion-inreal-time.Manyofthesocialmediaplatformsofferfull-to-limitedaccesstotheirsocialdata,whichtypicallycontainsin-formationregardingtheirmember'scommunications,interactionsandactivity.4.1.1ArchitectureandCongurationAsshowninFigure2,therststageofthereal-timeprocess-ingpipelineinvolvesconnectingtovariousexternalAPIsinorder.Pre-ProcessingStageinvolvestheharvesting,enrichmentandunicationofvariousreal-timestreamsfromWebservices.ThisprocessisachievedbyusingseparateWebharvestersconnectedtothedifferentexternalAPIs,whicharejointlycontrolledbythepro-cessingcomponent.Foreachharvester,theExternaldatastreamsarerstcollected,extractedfromtheirownuniquedataschema,andthenconvertedintotheSUWOreal-timestreamJSONformat.Aspartoftheconversionintoauniformschemaweperformalightweightenrichmentprocessinordertoensureconsistencyacrossstreams.Thisenrichmentprocess,performedinreal-timeensuresthateachrecordhasataminimum,atimestamp(ISO8068),sourceidentier(i.e.`twitter_public_api')anduniqueidentier.Aftertheinitialpre-processingstage,theenrichedanduniformeddatastreamsarethenprocessedintheStreamingStage,whichusestheAdvancedMessagequeueingprotocol(AMQP)[21]torestreamtheincomingdatasources.Inourreal-timeprocessingpipeline,AMQPactsmiddlewareforourpublish-and-subscribeapproach,itactsasascalabletechnologywhichisdesignedforhigh-volumemessageprocessingandpassing[11].InordertotakeadvantageofAMQPcapabilities,weuseRabbitMQ,anopensourcemes-sagebrokerandqueueingservicethatsupportsadvancedAMQPfunctionality,suchasthein-memoryqueueingandexchange-basedpublish-and-subscribeservices.ManyoftheWebsourcesbeingharvestedproducehighvolumefeedsasaconsequenceoftherateatwhichincomingrecordsareharvestedandpre-processed,itisproblematictouseaqueueingap-proachtoholdmessages-in-memoryorondisk-untilclientscon-nectandpopthemoffthequeue.Analternativeapproach,whichisfarmoresuitedtohandlingmultiple,highvolumedatastreams,istotakeadvantageofRabbitMQ'sexchangemechanism,whichef-fectivelyisapublish-and-subscribeservice,whichallowsmultipleclientstoconnecttoa`temporary'queueandretrievetheincomingmessages.Theadvantageofthismethodisthatitrequiressub-stantiallylessresourcesthanaqueueingapproach,andthatclientsarereceivingthedataassoonastheyconnect.However,thedisadvantageofthisapproachisthatclientsareofferedno`cache'whenconnectingtotheexchange,thusifthereisnoincomingmes-sages,thestreammayappeartobeinactive.Foreachpre-processedstreamweinstantiateasingleexchange,andforcombinedstreams,weusearecursiveapproachbyconnectingtothesinglestreamex- changes,performadditionalprocessing,andthenpublishthisonanewexchange.Thenalstagetothereal-timestreamprocessingworkowin-volvestwocomponents,internalconsumptionforarchivingpur-poses,andaAMQPHTTPmiddlewareinordertomaketheex-changesavailableviatheWebObservatoryportal.Althoughdis-cussedinmoredetailinSection4.2,themiddlewareusedtocon-necttotheAMQPexchangesisalightweightservicethatdoesnotprocessdata,butprovidesthenecessaryhandshakingandlayerofsecurityforclientstoconnecttotheexchanges.4.1.2ChallengesAsdescribedinSection2,therearevariouschallengesassociatedwithprocessingreal-timestreamsofheterogeneous,unstructureddata.Oneofthecorechallengesfacedindevelopingthissolutioninvolvedbeingabletotoprocessthehighvolumeofinformationinatimelyfashion,whichincludesunifying,andinsomecasesenrichingthedatastreamsinordertoobtainaconsistentandusablestream.Manyoftheenrichmentsub-processesrequiredcallstoexternalservices(e.g.geographiclocation),whichpotentiallyslowdownprocessingtime.Inordertoovercomethis,aninternalcacheofrecentlylooked-upresourcesisusedtoensurethatprocessingofstreamsisperformedinatimelymanner.Multi-streamintegrationisalsoanon-goingareaofresearch.Es-sentially,theintegrationofstreamsisachievedbyndingacom-moneldor`pattern'betweenstreams,whichinmanycases,isatopic,keyword,orsimply,atimestamp.FortheSUWOreal-timestreamintegration,weareworkingondynamicmethodsofinte-gratingstreams.Oneapproachinvolvesmonitoringatheoverallmessagerateofagivensetofstreams(i.e.postsperminute),andusinguctuationsinstreamvolumesasanearlyindicatorforcom-biningstreamsundergoingsimilarchanges.4.2PublicAccess-WebObservatoryAPI4.2.1APIDesignTheSUWOmaintainsmetadataofallresources,andthosemeta-dataareinternallyrepresentedusingtheDataCatalogVocabulary(DCAT)[14](Figure3).TheSUWOAPIisbuiltbymappingDCATdocumentstoaRESTAPIinawaythatpreservesthese-manticsofDCAT.TheSUWO

5 APIfollowstheHypermediaastheEngineofAppl
APIfollowstheHypermediaastheEngineofApplicationState(HATEOAS)constraintofREST,thatenablesapplicationstoexplorethewholeAPIfromasingleentrywithoutreferringtoexternaldocumentations.AsamplemappingofDCATdistributiontoRESTAPIisshowninFigure4. Figure3:AsimplieddiagramdemonstratingthethreemainclassesofDCATandtheirrelationships.Acompletediagramisgivenin[14]. StrictlyspeakinganAPIisnotRESTfulwithoutHATEOAS 1151 ofwhichareinstancesofsystemsgatheringandpreparingdynamicweb-basedexternaldataforthepurposesofdecisionsupportandThesecondclassincludessystems(whilststillcomprisingmanyofthefeaturesofWO)thataretypic-allyorginallyconceivedwithotherpurposessuchasdatarepositories,dataaggregationservicesornon-webresearchandanaytictools.ExampleshereincludetheCOSMOSObservatory,theePrintsdocumentrepository,TamrandtheDatasiftsocialnetworkdataplatformThechallengesaroundgathering,storingandanalysingdataatWebscalearevastwhenconsideredfromatechnicalperspective:fromintegratingpublishingtechnologies,toqueryingacrossmul-tiplesources,toharmonisingmeta-dataschemesandaccessmeth-ods.Addedtothisthechallengeofmakingsuchdata/analysesac-cessibletonon-technicalusers,exacerbatesthedesignchallengesevenfurther.Thefollowingareafewexemplarswhohavecon-tributedvariouslytothesechallenges.TheNeXTplatformhasdevelopedmethodstoaccessextremelyhighvolumeimageandmicro-blogprocessingalgorithmsofferingproduct/person/locationbasedviews.TheSUWOportalhasdevelopedmechanismstohostdataandanalyticsandperformdatabaseagnosticqueryingfromhistoricdatasources.TheCOSMOSplatformenablesdrag-and-dropanalyticsoflocally-baseddatasetsfornon-technicalre-2.2Real-timeProcessingManyofthekeychallengesfacedforimplementingreal-timeprocessingforWebObservatoriesisalsosituatedinawidercom-munityofresearchofinvolvedinreal-timestreamprocessing[18,2].AsHeinzeetal[10]describe,thereareseveralimportantas-pectsof(real-time)streamprocessingincludingdesigningscalablesolutionswhichareabletobothpartitionandquerydataefciently,andsystemswhichhaveanamountoffaulttolerance,beingabletoactivelyrespondtochangesinstreamconditions.Researchonthesetopicsincludetheoptimisationofhardwareandsoftwareprocessestoimproverun-timeefciency[19,1]andimprovehardwareload-balancing[22].Therearealsosubstantialeffortsinvolvedindataintegrationandaggregation,developingap-proachestodataintegrationusingdifferentpublish-and-subscribeparadigms[5,8].Ourresearchisalsosituatedintheareaofeventdetectionandprocessing[13],whichcontainoverlapwithtopicsofstreamprocessingsuchasdataintegrationandaggregation[7,12,16]buthaveaparticularfocusonusingthesetechniquesinordertoextractevents[17].Thebasisofourworkdrawsuponthetechniquesdescribedintheaforementionedliteratureinordertodevelopasolutionwhichutilisesapublish-and-subscribeapproachincombinationwiththecoreprinciplesofperformingreal-timestreamprocessingasde-scribedbyStonebrakeretal.[18].3.SOUTHAMPTONWEBOBSERVATORY(SUWO)TheSouthamptonWebObservatory(SUWO)engagesusercom-munitieswithdatasetandanalyticresourcesviatheSUWOportal.Ingeneral,theSUWOportalprovidesaccesstothefollow-ingtypesofresources:(1)Datasets:thesecanbehistoricalor http://twitter,eprints.soton.ac.ukreal-time,quantitativeorqualitative,andareheterogeneousincon-tent.(2)Applications:theserepresentanalyticalapplications,oftensupportedviavisualisations,whicharelinkedbacktothedatasetslistedontheWebObservatoryportal.(3)Tools:analyticaltoolk-itswhichprovideanalyticalmethodsfordatasetslisted-butnotarequirement-ontheportal.3.1ArchitecturalPrinciplesFourfundamentalprincipleswereconsideredinthedesignoftheSUWO,asummarisedaccountoftheseislistedbelow:Notalldatasetsorapplicationscanbepublic.Accesstosomedatasetsneedstoberestrictedforlicensing,privacyorotherrea-sons.TheWebObservatoryallowsitsuserstolistorhostdatasetsthatarepublicorprivate.AccesstoprivatedatasetsismanagedbytheuserwhohoststhemontheWebObservatory.Sinceaccesstodatasetscanberestricted,accesstoapplicationsthatmakeuseofthosedatasetsneedstoberestrictedaswell.WebObservatorieslisttwomaintypesofresources:datasetsandanalyticapplications,includingvisualisations.Thelinkbe-tweenalistedanalyticapplicationandthedatasetsthatitusesmustalwaysbemadeexplicit,eveniftheuseddatasetsarelistedaspri-vate,withrestrictedaccess.Notalllistedresourcesneedtobelocallyhosted.Listeddatasetsoranalyticapplicationscanbehostedinremoteserversmanagedbythirdparties.Metadatadescribingthelistedresourcesandprojectsarepub-.Thisway,descriptionsofresourcescanbeharvestedandlistedinotherWebObservatoriesorWeb-basedresources.Basedonthesefourprinciplesandtherequirementsdescribedabove,thedesignoftheSUWOarchitectureaimstoabstractthetypesoffeaturesandrequirementsintothreesetsofsub-components,asshowninFigure1.Eachoftheseparatecomponentscontaintheirownindividualarchitecture,workows,andassociatedtech-nologies.Theyinterfacewitheachotherusingopenstandardsandprotocolstoensurethe

6 highestlevelofmodularityandre-congur
highestlevelofmodularityandre-conguration.Eachsub-componentfunctionsautonomously,thuscanpotentiallyinteractwithotherWebObservatories,asdescribedin[20].Forthispaperwefocusspecicallyonthereal-timeprocessingtechnologyandworkowsusedwithinthedatasetcomponent,andtheconstructionoftheWebObservatoryAPI. Figure1:ArchitectureoftheSouthamptonUniversityWebOb-servatory 1150 AStreamingReal-TimeWebObservatoryArchitectureforMonitoringtheHealthofSocialMachinesRamineTinati,XinWang,IanBrown,ThanassisTiropanis,WendyHallUniversityofSouthamptonWebandInternetScience{r.tinati,x.wang,ian.brown.tt2,wh}@ecs.soton.ac.ukABSTRACTOverthepastyears,streamingWebserviceshavebecomepopular,withmanyofthetopWebplatformsnowofferingnearreal-timestreamsofuserandmachineactivity.Inlightofthis,WebObser-vatoriesnowarefacedwiththechallengeofbeingabletoprocessandrepublishreal-time,bigdata,Webstreams,whilstmaintainingaccesscontrolanddataconsistency.InthispaperwedescribethearchitectureusedintheSouthamptonWebObservatorytoharvest,process,andservereal-timeWebstreams.CategoriesandSubjectDescriptorsH.3.5[INFORMATIONSTORAGEANDRETRIEVAL]:On-lineInformationServices—DatasharingKeywordsWebObservatory;Real-timeProcessing;DataSecurity1.INTRODUCTIONAstheWebgrows,thenumberofhumansinteractingwiththeWeb,andthetypesofsocialmachinesareincreasing,asarethedevicesusedtoengagewiththem.Asaconsequenceofthischang-ingtechnologicalandsociallandscape,thereisagrowingtrendtowardssocialmachinesemittingreal-timestreamsofsystemanduseractivity.Thesereal-timestreams,whichcanbeconsideredasbigdataduetotheirsize,speed,andunstructurednature[15],offerWebObservatorieswithrich,timelyresourcesforobservationandanalysis.Individually,thesefeedsprovidearesourcetomeasurethecurrentstate-orhealth-ofasocialmachines,andcombined,theyhavethepotentialtoprovideacollectivepulseoftheWeb.However,workingwithreal-timebigdatasourcesintroducesanumberofnewchallengesforWebObservatoryplatformswhichwereinitiallybuiltwithtechnologiesdesignedtoprocesscuratedandstructuredresources.Manyofthesechallengesarearesultofthecharacteristicsofbigdata,whichisoftenlargescale,highvol-ume,andisnotpublishedinastandardised,(common)structuredfashioned.Therefore,thechallengeliesinbeingabletoprocesstheminatimelymanner,whichiscomputationallyefcient,forCopyrightisheldbytheInternationalWorldWideWebConferenceCom-mittee(IW3C2).IW3C2reservestherighttoprovideahyperlinktotheauthor'ssiteiftheMaterialisusedinelectronicmedia.WWW2015Companion,May18–22,2015,Florence,Italy.ACM978-1-4503-3473-0/15/05.http://dx.doi.org/10.1145/2740908.2743977.boththepublisherandsubscriber.Awareofthesechallenges,theWebObservatorybuildershasalreadyconductedsubstantialre-searchintotechniquesforharvesting,processing,archiving,query-ing,andvisualisationofdata[9,4,6],Inadditiontothechallengesofprocessingreal-timedata,aWebObservatoryintroducesanotherlayerofcomplexity:datacontrolandsharing.AfundamentalcomponentofaWebObservatoryisfordataownerstocontrolwhohasaccesstotheirdata,andwheretheirdataisused.Therefore,WebObservatoriesneedtoextendthetypicaldataprocessingpipelineofdatacollection-to-datavisuali-sation,andappendthiswithalayerofdatasharing,securityandaccesscontrol,forresourcethatitcontains.InthispaperwedescribetheSouthamptonWebObservatory'sapproachtoengineeringareal-timestreamingWebObservatoryarchitecturetodeliverresourcesviaacustomdevelopedAPI,whichutilisescommonOAuth2.0protocolsinordertoprovideaccesscontroltodataowners.Wealsodescribethetypeofmetricsthatreal-timefeedscanprovidetomonitorandmeasurethehealthofasocialmachine.Theremainderofthispaperwillbestructuredasfollows,werstdiscussthecurrentstate-of-the-artintermsofWebObserva-toriesandreal-timeprocessing.WethenintroducetheWebOb-servatoryconceptandreferencearchitecture,withparticularfocusonthestreamingandAPIlayers.WethendescribeourapproachtoharvestingandpublishingviaAPIs.Finallyweconsideranum-berofscenariosformonitoringthehealthofsocialmachines,anddiscussthefutureareasofresearchstillrequired.2.RELATEDWORKInthissectionwedescribeexistingandon-goingresearchrele-vanttothedevelopmentofWebObservatoriesandprocessingreal-timeWebdata.2.1WebObservatoryDevelopmentThecurrentbodyofWebObservatoriesarecomprisedoftwomaingroupsbelongingeithertosystemsconceivedanddevelopedspecicallyasWebObservatories[9]andthoseconvergingonWebObservatorystatusfromotherstartingpositions(suchasanalyticsplatformsordatarepositories)[6,4].Itshouldbenotedthatwedonotinclude/excludebasedonthespecicdesignation“Obser-vatory”forobviousreasonsbutfocusratheronthefunctionalityexhibitedbythesystem(basedonourdevelopingtaxonomyofOb-servatories[3].ExamplesoftherstclassincludetheSouthamptonWebObser-vatory(SUWO)[9],theSingaporeNUSNeXTSocialObservatory[6],theRecordedFutureWebIntelligenceplatformandQuid 1http://https://www.recordedfuture.com/2http://http://quid.com/ 11