IT代写 data mining algorithm Anomaly Detection in Evolving Data Streams – cscodehelp代写

Anomaly Detection in Evolving Data Streams
COMP90073 Security Analytics
, CIS Semester 2, 2021

Outline
• Introductiontodatastreams • Windowingtechniques
• HS-Trees
• IncrementalLOF(iLOF)
– Memory-efficientiLOF(MiLOF)
COMP90073 Security Analytics © University of Melbourne 2021

Data Streams
Data stream is a sequence of data points, which is continues, unbounded, and nonstationary.
• StreamliningAnalysis
– Largevolumeofdata
– Short/real-timeresponse
– Limitedmemory
– Energy/communicationconstraints
COMP90073 Security Analytics © University of Melbourne 2021

Batch Learning vs. Incremental Learning
Batch Leaning: Data points are stored until they can be analysed at the end of a monitoring period. Batch learning methods
• Canbecomputationallyefficient
• Theiraccuracyisheavilydependentonagoodchoiceforthetrainingperiod
and the quality of the training data
• Cannotbeappliedinstreamingenvironments,wherethemeasurementsarrive as a continuous stream of data
• Cannotbeusedinresourceconstraintdevicestobufferallthemeasurements
• Cannotidentifyanomalouspointsastheyoccur
• Cannotadapttochangesintheenvironment(e.g.,drift)
Incremental Learning: Data points are (usually) analysed once and there is no need to buffer the data. Incremental methods
• Startwithasetofinitialparametersfortheselectedmodelandtheybecomes more accurate as more data arrives
COMP90073 Security Analytics © University of Melbourne 2021

Different Windowing Techniques for Data Streams
• Landmarkwindows:Afixedpointinthedata stream is defined as a landmark and processing is done over data points between the landmark and the present data point.
• Dampedwindows:Aweightisassignedtoeach data point in such a way that the old data points are given smaller weights. Therefore, the most recent data points are used with higher weights.
COMP90073 Security Analytics © University of Melbourne 2021

Different Windowing Techniques for Data Streams
• Slidingwindows:Aslidingwindowsizewis considered in this technique. It processes the last w data points in the stream, while older data points are discarded.
• Adaptivewindows:Thewindowsizewwould change as the data stream evolves. In this technique, the more the data points evolve, the smaller w becomes. In contrast, if data points remain constant, the value of w increases.
COMP90073 Security Analytics © University of Melbourne 2021

Streaming Half Space (HS)-Trees [1]
A fast one-class anomaly detector for evolving data streams. • Arandomtreemodel
• Buildstreestructurewithoutdata
• Detectsanomaliesinonepass
• Adaptstodistributionchangesbyregularmodelupdates • Updates model in constant time 𝑂 𝑡(h + 𝜓)
• Requires constant amount of memory 𝑂 𝑡2!
𝑡: number of trees, h: depth of tree, 𝜓: window size
COMP90073 Security Analytics © University of Melbourne 2021

Half Space-Tree
An HS-Tree is a full binary tree, which all leaves are at the same depth h.
• Randomlyselectanattributed
• Bisectsthespaceintotwohalf-spaces,usingthemid-pointofd(assumethat attributes’ ranges are normalised to [0, 1])
• Continueexpansionuntilthemaximumdepthhofallnodesisreached. • Employs mass as a measure to rank anomalies.
COMP90073 Security Analytics © University of Melbourne 2021

Separating the Regions: HS-Trees
COMP90073 Security Analytics © University of Melbourne 2021

Ranking by Mass
Anomaly
Normal
COMP90073 Security Analytics © University of Melbourne 2021

Stream Processing
• Dividedatastreamintofixed-sizewindows:W1,W2,…,Wn
• Eachwindowisafixednumberofsequenceddatainstances • Initial Learning: Train model M1 using instances in W1
• SubsequentLearningandAnomalyScoring
For each window Wk (where k >1)
Train model Mk using instances in Wk Test instances in Wk using model Mk-1
Next window
COMP90073 Security Analytics © University of Melbourne 2021

Count-based Windows
• Letwindowsize=3
• Initialstage
• W1:referencewindow
– TrainHS-Treesandupdatemassr
COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example
A
Mass profile of reference window
Mass profile of latest window
C
DE
FG
B
COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example
Train using x1
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example
Train using x2
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example
Train using x3
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example
Training is complete
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021

Count-based Windows
• Letwindowsize=3
• Initialstage
• W1:referencewindow
– TrainHS-Treesandupdatemassr • W2:latestwindow
– InstancesinW2fortrainingHS-Trees(massl) – InstancesinW2fortestingHS-Trees(massr)
COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example
Train/Test using x4
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example
Train/Test using x5
B
C
DE
FG
A
COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example
Train/Test using x6
B
C
A
DE
X6
FG
COMP90073 Security Analytics © University of Melbourne 2021

Count-based Windows
When all instances in W2 are processed
• Modelupdateoccurs
• W2becomesthenewreferencewindow
– Transferallmasslvaluestomassrvalues
– Resetallmasslvaluestozero • W3becomesthelatestwindow
– InstancesinW3fortrainingHS-Trees(massl)
– InstancesinW3fortestingHS-Trees(massr) And so on…
COMP90073 Security Analytics © University of Melbourne 2021

Model Update
COMP90073 Security Analytics © University of Melbourne 2021

Model Update
COMP90073 Security Analytics © University of Melbourne 2021

Anomaly Score in HS-Tree
• The final score for 𝑥 is the sum of scores obtained from each HS-Tree in the ensemble
𝑎𝑛𝑜𝑚𝑎𝑙𝑦 𝑠𝑐𝑜𝑟𝑒 𝑥 = 5 𝑆𝑐𝑜𝑟𝑒(𝑥, 𝑡%) “∈$
𝑆𝑐𝑜𝑟𝑒 𝑥, 𝑡% = 𝑁𝑜𝑑𝑒&×2′()*!
r value Depth of at node node
COMP90073 Security Analytics © University of Melbourne 2021

LOF – Revision
Advantages of LOF for anomaly detection:
• Detectsanomaliesregardlessthedatadistributionofnormalbehaviour,sinceit does not make any assumptions about the distributions of data records.
• Detectsanomalieswithrespecttodensityoftheirneighbouringdatarecords; not to the global model.
• DirectlyapplyingLOFtodatastreamswouldbeextremelycomputationally inefficient and/or very often may lead to incorrect prediction results.
COMP90073 Security Analytics © University of Melbourne 2021

Extending LOF to Data Streams
(i) Periodic LOF. Apply LOF algorithm on the entire data set periodically (e.g., after every data block of 1000 data records) or after all the data records are inserted.
• Themajorproblemofthisapproachisinabilitytodetectanomaliesrelatedto the beginning of new behaviour that initially appear within the inserted block.
COMP90073 Security Analytics © University of Melbourne 2021

Extending LOF to Data Streams
(ii) Iterated LOF: Re-apply the static LOF algorithm every time a new data record is inserted into the dataset.
• ThisstaticLOFalgorithmdoesnotsufferfromthepreviousproblems,butis extremely computationally expensive.
• Increases LOF’s time complexity to 𝑂(𝑛+ log 𝑛), where 𝑛 is the current number of data records in the data set
COMP90073 Security Analytics © University of Melbourne 2021

Incremental LOF (iLOF) [2]
Objectives:
• Theresultoftheincrementalalgorithmmustbeequivalenttotheresultofthe “batch”.
• TimecomplexityofincrementalLOFalgorithmhastobecomparabletothe static LOF algorithm 𝑂(𝑛 log 𝑛).
Step 1 – Insertion:
• Insertionofnewrecord,computek-dist,reachdist,IrdandLOFvaluesofa
new point
• Maintenance,updatek-dist,reachdist,IrdandLOFvaluesforaffectedexisting
points.
Step 2 – Deletion: Delete certain data records (e.g., due to their obsoleteness).
• Maintenance,updatek-dist,reachdist,IrdandLOFvaluesforaffectedexisting points.
COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 1 (Maintenance)
• Updatingk-dist:Insertionofthepointnmaydecreasethek-distanceofcertain neighbouring points, and it can happen only to those points that have the new point n in their k-neighbourhood (e.g., 2-neighbourhood).
• ReverseNearestNeighbour(RNN):Findalltheobjectsforwhichthenew point n is their (k-)nearest neighbour.
kNN
RNN
COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 1 (Maintenance)
• Updatingreachdistk:Whenk-distance(p)changesforapointp, reachdistk(o,p) will be affected only for points o that are in k-neighbourhood of the point p.
COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 1 (Maintenance)
• Updatinglrd:Irdvalueofapointpisaffectedif:
– Thek-neighbourhoodofthepointpchanges,
– Reachdist from point p to one of its k-neighbours changes.
COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 1 (Maintenance)
• UpdatingLOFValues:LOFvaluesofanexistingpointpshouldbeupdatedif – Ird(p)isupdated,or
– Ird(p)ofoneofitsk-neighbourspchanges
COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 2 (Deletion)
• Updating k-dist: The deletion of each record pc from the dataset influences the k-distances of its RNN.
– k-neighbourhood increases for each data record pj that is in reverse k- nearest neighbourhood of pc. The new k-distance for pj becomes equal to its distance to its new k-th nearest neighbour.
pz
py
px
COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 2 (Deletion)
• Updating k-dist: The deletion of each record pc from the dataset influences the k-distances of its RNN.
– k-neighbourhood increases for each data record pj that is in reverse k- nearest neighbourhood of pc. The new k-distance for pj becomes equal to its distance to its new k-th nearest neighbour.
pz
COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 2 (Deletion)
• Updatingreachdist:Thereachabilitydistancesfrompj’snearestneighbours need to be updated.
• Updatinglrd:Irdvalueneedstobeupdatedfor
– Allpointspj,whichk-distanceisupdated.
– Allpointspi, whichisink-NNofpj andpj isink-NNofpi.
• UpdatingLOFValues:LOFvalueisupdatedfor – Allpointspj,whichIrdvalueisupdated
– All points pi, which is in RNN of pj
COMP90073 Security Analytics © University of Melbourne 2021

Shortcomings of iLOF
Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents
• Theaccuracywilldropbydeletingthehistory
COMP90073 Security Analytics © University of Melbourne 2021

Shortcomings of iLOF
Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents
• Theaccuracywilldropbydeletingthehistory
COMP90073 Security Analytics © University of Melbourne 2021

Shortcomings of iLOF
Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents
• Theaccuracywilldropbydeletingthehistory
COMP90073 Security Analytics © University of Melbourne 2021

Shortcomings of iLOF
Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents
• Theaccuracywilldropbydeletingthehistory
COMP90073 Security Analytics © University of Melbourne 2021

Memory Efficient Incremental Local Outlier (MiLOF) Detection [3]
Objective: Assign an LOF value to a point 𝑝”, under the constraint that the available memory stores only a fraction 𝑚 ≪ 𝑛 of the 𝑛 points that have been observed up to time 𝑇.
• Needtochooseastrategytosummarizethepreviousdatapointssothatthe LOF values of new points can be calculated.
MiLOF Phases:
• Summarization
• Merging
• RevisedInsertion
COMP90073 Security Analytics © University of Melbourne 2021

Three Phases of MiLOF – Framework
Compute outlier score of new point using the latest data points and latest merged clustering model
COMP90073 Security Analytics © University of Melbourne 2021

Three Phases of MiLOF
Phase 1 – Summarization:
Build a summary over the past data points along with their corresponding values (k-dist, lrd and LOF), and deleting them from memory.
• Everybucketdatapointsaresummarizedandclustercentresaregenerated using k-means clustering
Notations:
• 𝐶:pointsarrivingattime𝑇
• Partition𝐶into𝑚clusters𝐶={𝐶,∪𝐶+∪⋯∪𝐶-},withclustercentres𝑉= {𝑣,,𝑣+,…,𝑣-}
COMP90073 Security Analytics © University of Melbourne 2021

MiLOF Measures
• k-dist of a cluster centre 𝑣% ∈ 𝑉
𝑘𝑑𝑖𝑠𝑡(𝑣%) = ∑.∈/” 𝑘𝑑𝑖𝑠𝑡 (𝑝)
|𝐶%|
• lrd of a cluster centre 𝑣% ∈ 𝑉
𝑙𝑟𝑑0(𝑣%) = ∑.∈/” 𝑙𝑟𝑑0(𝑝)
|𝐶%|
• LOF of a cluster centre 𝑣% ∈ 𝑉
𝐿𝑂𝐹0(𝑣%) = ∑.∈/” 𝐿𝑂𝐹0(𝑝) |𝐶%|
Number of points in 𝐶#
COMP90073 Security Analytics © University of Melbourne 2021

Three Phases of MiLOF
Phase 2 – Merging:
Merge the clusters with existing clusters to maintain a single set of cluster centres by the anomaly detection framework after each step.
• Usingaweightedclusteringalgorithm(weightedk-means)andclusterthe cluster centres
• Clustercentre’sweightisequaltothenumberofdatapointsinthatcluster Phase 3 – Revised Insertion:
• ComputeLOFvalueofthenewincomingdata point p, w.r.t. both the recent data points and cluster centres.
– If a cluster centre is the ith NN of p, we stop searching for the rest of the nearest neighbours.
• Updatethe𝑘𝑑𝑖𝑠𝑡,reachdist,lrdandLOFvaluesfor the existing data points
COMP90073 Security Analytics © University of Melbourne 2021

Summary
• Whataredifferentwindowingtechniquesfordatastreams?
• Howtoapplytreebasedanomalydetectionmethodstodatastreams?
• HowtoextendLOFforincrementallearningwhilemaintainingitsperformance?
Next: Anomaly Detection Using Support Vector Machine
COMP90073 Security Analytics © University of Melbourne 2021

References
1. Tan, Ting, Liu, “Fast Anomaly Detection for Streaming Data”, International Joint Conference on Artificial Intelligence (IJCAI), 2011
– https://github.com/yli96/HSTree
2. , , Latecki, “Incremental Local Outlier Detection for Data Streams”, IEEE Symposium on Computational Intelligence and Data Mining, 2007
3. , , . Bezdek, , , “Fast Memory Efficient Local Outlier Detection in Data Streams”, IEEE Transactions on Knowledge and Data Engineering (TKDE), 2016
COMP90073 Security Analytics © University of Melbourne 2021

Leave a Reply

Your email address will not be published. Required fields are marked *