# IT代写 data mining algorithm Anomaly Detection in Evolving Data Streams – cscodehelp代写

Anomaly Detection in Evolving Data Streams

COMP90073 Security Analytics

, CIS Semester 2, 2021

Outline

• Introductiontodatastreams • Windowingtechniques

• HS-Trees

• IncrementalLOF(iLOF)

– Memory-efficientiLOF(MiLOF)

COMP90073 Security Analytics © University of Melbourne 2021

Data Streams

Data stream is a sequence of data points, which is continues, unbounded, and nonstationary.

• StreamliningAnalysis

– Largevolumeofdata

– Short/real-timeresponse

– Limitedmemory

– Energy/communicationconstraints

COMP90073 Security Analytics © University of Melbourne 2021

Batch Learning vs. Incremental Learning

Batch Leaning: Data points are stored until they can be analysed at the end of a monitoring period. Batch learning methods

• Canbecomputationallyefficient

• Theiraccuracyisheavilydependentonagoodchoiceforthetrainingperiod

and the quality of the training data

• Cannotbeappliedinstreamingenvironments,wherethemeasurementsarrive as a continuous stream of data

• Cannotbeusedinresourceconstraintdevicestobufferallthemeasurements

• Cannotidentifyanomalouspointsastheyoccur

• Cannotadapttochangesintheenvironment(e.g.,drift)

Incremental Learning: Data points are (usually) analysed once and there is no need to buffer the data. Incremental methods

• Startwithasetofinitialparametersfortheselectedmodelandtheybecomes more accurate as more data arrives

COMP90073 Security Analytics © University of Melbourne 2021

Different Windowing Techniques for Data Streams

• Landmarkwindows:Afixedpointinthedata stream is defined as a landmark and processing is done over data points between the landmark and the present data point.

• Dampedwindows:Aweightisassignedtoeach data point in such a way that the old data points are given smaller weights. Therefore, the most recent data points are used with higher weights.

COMP90073 Security Analytics © University of Melbourne 2021

Different Windowing Techniques for Data Streams

• Slidingwindows:Aslidingwindowsizewis considered in this technique. It processes the last w data points in the stream, while older data points are discarded.

• Adaptivewindows:Thewindowsizewwould change as the data stream evolves. In this technique, the more the data points evolve, the smaller w becomes. In contrast, if data points remain constant, the value of w increases.

COMP90073 Security Analytics © University of Melbourne 2021

Streaming Half Space (HS)-Trees [1]

A fast one-class anomaly detector for evolving data streams. • Arandomtreemodel

• Buildstreestructurewithoutdata

• Detectsanomaliesinonepass

• Adaptstodistributionchangesbyregularmodelupdates • Updates model in constant time 𝑂 𝑡(h + 𝜓)

• Requires constant amount of memory 𝑂 𝑡2!

𝑡: number of trees, h: depth of tree, 𝜓: window size

COMP90073 Security Analytics © University of Melbourne 2021

Half Space-Tree

An HS-Tree is a full binary tree, which all leaves are at the same depth h.

• Randomlyselectanattributed

• Bisectsthespaceintotwohalf-spaces,usingthemid-pointofd(assumethat attributes’ ranges are normalised to [0, 1])

• Continueexpansionuntilthemaximumdepthhofallnodesisreached. • Employs mass as a measure to rank anomalies.

COMP90073 Security Analytics © University of Melbourne 2021

Separating the Regions: HS-Trees

COMP90073 Security Analytics © University of Melbourne 2021

Ranking by Mass

Anomaly

Normal

COMP90073 Security Analytics © University of Melbourne 2021

Stream Processing

• Dividedatastreamintofixed-sizewindows:W1,W2,…,Wn

• Eachwindowisafixednumberofsequenceddatainstances • Initial Learning: Train model M1 using instances in W1

• SubsequentLearningandAnomalyScoring

For each window Wk (where k >1)

Train model Mk using instances in Wk Test instances in Wk using model Mk-1

Next window

COMP90073 Security Analytics © University of Melbourne 2021

Count-based Windows

• Letwindowsize=3

• Initialstage

• W1:referencewindow

– TrainHS-Treesandupdatemassr

COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example

A

Mass profile of reference window

Mass profile of latest window

C

DE

FG

B

COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example

Train using x1

B

C

DE

FG

A

COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example

Train using x2

B

C

DE

FG

A

COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example

Train using x3

B

C

DE

FG

A

COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example

Training is complete

B

C

DE

FG

A

COMP90073 Security Analytics © University of Melbourne 2021

Count-based Windows

• Letwindowsize=3

• Initialstage

• W1:referencewindow

– TrainHS-Treesandupdatemassr • W2:latestwindow

– InstancesinW2fortrainingHS-Trees(massl) – InstancesinW2fortestingHS-Trees(massr)

COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example

Train/Test using x4

B

C

DE

FG

A

COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example

Train/Test using x5

B

C

DE

FG

A

COMP90073 Security Analytics © University of Melbourne 2021

Streaming HS-Trees – Example

Train/Test using x6

B

C

A

DE

X6

FG

COMP90073 Security Analytics © University of Melbourne 2021

Count-based Windows

When all instances in W2 are processed

• Modelupdateoccurs

• W2becomesthenewreferencewindow

– Transferallmasslvaluestomassrvalues

– Resetallmasslvaluestozero • W3becomesthelatestwindow

– InstancesinW3fortrainingHS-Trees(massl)

– InstancesinW3fortestingHS-Trees(massr) And so on…

COMP90073 Security Analytics © University of Melbourne 2021

Model Update

COMP90073 Security Analytics © University of Melbourne 2021

Model Update

COMP90073 Security Analytics © University of Melbourne 2021

Anomaly Score in HS-Tree

• The final score for 𝑥 is the sum of scores obtained from each HS-Tree in the ensemble

𝑎𝑛𝑜𝑚𝑎𝑙𝑦 𝑠𝑐𝑜𝑟𝑒 𝑥 = 5 𝑆𝑐𝑜𝑟𝑒(𝑥, 𝑡%) “∈$

𝑆𝑐𝑜𝑟𝑒 𝑥, 𝑡% = 𝑁𝑜𝑑𝑒&×2′()*!

r value Depth of at node node

COMP90073 Security Analytics © University of Melbourne 2021

LOF – Revision

Advantages of LOF for anomaly detection:

• Detectsanomaliesregardlessthedatadistributionofnormalbehaviour,sinceit does not make any assumptions about the distributions of data records.

• Detectsanomalieswithrespecttodensityoftheirneighbouringdatarecords; not to the global model.

• DirectlyapplyingLOFtodatastreamswouldbeextremelycomputationally inefficient and/or very often may lead to incorrect prediction results.

COMP90073 Security Analytics © University of Melbourne 2021

Extending LOF to Data Streams

(i) Periodic LOF. Apply LOF algorithm on the entire data set periodically (e.g., after every data block of 1000 data records) or after all the data records are inserted.

• Themajorproblemofthisapproachisinabilitytodetectanomaliesrelatedto the beginning of new behaviour that initially appear within the inserted block.

COMP90073 Security Analytics © University of Melbourne 2021

Extending LOF to Data Streams

(ii) Iterated LOF: Re-apply the static LOF algorithm every time a new data record is inserted into the dataset.

• ThisstaticLOFalgorithmdoesnotsufferfromthepreviousproblems,butis extremely computationally expensive.

• Increases LOF’s time complexity to 𝑂(𝑛+ log 𝑛), where 𝑛 is the current number of data records in the data set

COMP90073 Security Analytics © University of Melbourne 2021

Incremental LOF (iLOF) [2]

Objectives:

• Theresultoftheincrementalalgorithmmustbeequivalenttotheresultofthe “batch”.

• TimecomplexityofincrementalLOFalgorithmhastobecomparabletothe static LOF algorithm 𝑂(𝑛 log 𝑛).

Step 1 – Insertion:

• Insertionofnewrecord,computek-dist,reachdist,IrdandLOFvaluesofa

new point

• Maintenance,updatek-dist,reachdist,IrdandLOFvaluesforaffectedexisting

points.

Step 2 – Deletion: Delete certain data records (e.g., due to their obsoleteness).

• Maintenance,updatek-dist,reachdist,IrdandLOFvaluesforaffectedexisting points.

COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 1 (Maintenance)

• Updatingk-dist:Insertionofthepointnmaydecreasethek-distanceofcertain neighbouring points, and it can happen only to those points that have the new point n in their k-neighbourhood (e.g., 2-neighbourhood).

• ReverseNearestNeighbour(RNN):Findalltheobjectsforwhichthenew point n is their (k-)nearest neighbour.

kNN

RNN

COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 1 (Maintenance)

• Updatingreachdistk:Whenk-distance(p)changesforapointp, reachdistk(o,p) will be affected only for points o that are in k-neighbourhood of the point p.

COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 1 (Maintenance)

• Updatinglrd:Irdvalueofapointpisaffectedif:

– Thek-neighbourhoodofthepointpchanges,

– Reachdist from point p to one of its k-neighbours changes.

COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 1 (Maintenance)

• UpdatingLOFValues:LOFvaluesofanexistingpointpshouldbeupdatedif – Ird(p)isupdated,or

– Ird(p)ofoneofitsk-neighbourspchanges

COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 2 (Deletion)

• Updating k-dist: The deletion of each record pc from the dataset influences the k-distances of its RNN.

– k-neighbourhood increases for each data record pj that is in reverse k- nearest neighbourhood of pc. The new k-distance for pj becomes equal to its distance to its new k-th nearest neighbour.

pz

py

px

COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 2 (Deletion)

• Updating k-dist: The deletion of each record pc from the dataset influences the k-distances of its RNN.

– k-neighbourhood increases for each data record pj that is in reverse k- nearest neighbourhood of pc. The new k-distance for pj becomes equal to its distance to its new k-th nearest neighbour.

pz

COMP90073 Security Analytics © University of Melbourne 2021

iLOF – Step 2 (Deletion)

• Updatingreachdist:Thereachabilitydistancesfrompj’snearestneighbours need to be updated.

• Updatinglrd:Irdvalueneedstobeupdatedfor

– Allpointspj,whichk-distanceisupdated.

– Allpointspi, whichisink-NNofpj andpj isink-NNofpi.

• UpdatingLOFValues:LOFvalueisupdatedfor – Allpointspj,whichIrdvalueisupdated

– All points pi, which is in RNN of pj

COMP90073 Security Analytics © University of Melbourne 2021

Shortcomings of iLOF

Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents

• Theaccuracywilldropbydeletingthehistory

COMP90073 Security Analytics © University of Melbourne 2021

Shortcomings of iLOF

Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents

• Theaccuracywilldropbydeletingthehistory

COMP90073 Security Analytics © University of Melbourne 2021

Shortcomings of iLOF

Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents

• Theaccuracywilldropbydeletingthehistory

COMP90073 Security Analytics © University of Melbourne 2021

Deleting past data points due to memory limitations causes two problems: • Differentiationbetweennewandoldevents

• Theaccuracywilldropbydeletingthehistory

COMP90073 Security Analytics © University of Melbourne 2021

Memory Efficient Incremental Local Outlier (MiLOF) Detection [3]

Objective: Assign an LOF value to a point 𝑝”, under the constraint that the available memory stores only a fraction 𝑚 ≪ 𝑛 of the 𝑛 points that have been observed up to time 𝑇.

• Needtochooseastrategytosummarizethepreviousdatapointssothatthe LOF values of new points can be calculated.

MiLOF Phases:

• Summarization

• Merging

• RevisedInsertion

COMP90073 Security Analytics © University of Melbourne 2021

Three Phases of MiLOF – Framework

Compute outlier score of new point using the latest data points and latest merged clustering model

COMP90073 Security Analytics © University of Melbourne 2021

Three Phases of MiLOF

Phase 1 – Summarization:

Build a summary over the past data points along with their corresponding values (k-dist, lrd and LOF), and deleting them from memory.

• Everybucketdatapointsaresummarizedandclustercentresaregenerated using k-means clustering

Notations:

• 𝐶:pointsarrivingattime𝑇

• Partition𝐶into𝑚clusters𝐶={𝐶,∪𝐶+∪⋯∪𝐶-},withclustercentres𝑉= {𝑣,,𝑣+,…,𝑣-}

COMP90073 Security Analytics © University of Melbourne 2021

MiLOF Measures

• k-dist of a cluster centre 𝑣% ∈ 𝑉

𝑘𝑑𝑖𝑠𝑡(𝑣%) = ∑.∈/” 𝑘𝑑𝑖𝑠𝑡 (𝑝)

|𝐶%|

• lrd of a cluster centre 𝑣% ∈ 𝑉

𝑙𝑟𝑑0(𝑣%) = ∑.∈/” 𝑙𝑟𝑑0(𝑝)

|𝐶%|

• LOF of a cluster centre 𝑣% ∈ 𝑉

𝐿𝑂𝐹0(𝑣%) = ∑.∈/” 𝐿𝑂𝐹0(𝑝) |𝐶%|

Number of points in 𝐶#

COMP90073 Security Analytics © University of Melbourne 2021

Three Phases of MiLOF

Phase 2 – Merging:

Merge the clusters with existing clusters to maintain a single set of cluster centres by the anomaly detection framework after each step.

• Usingaweightedclusteringalgorithm(weightedk-means)andclusterthe cluster centres

• Clustercentre’sweightisequaltothenumberofdatapointsinthatcluster Phase 3 – Revised Insertion:

• ComputeLOFvalueofthenewincomingdata point p, w.r.t. both the recent data points and cluster centres.

– If a cluster centre is the ith NN of p, we stop searching for the rest of the nearest neighbours.

• Updatethe𝑘𝑑𝑖𝑠𝑡,reachdist,lrdandLOFvaluesfor the existing data points

COMP90073 Security Analytics © University of Melbourne 2021

Summary

• Whataredifferentwindowingtechniquesfordatastreams?

• Howtoapplytreebasedanomalydetectionmethodstodatastreams?

• HowtoextendLOFforincrementallearningwhilemaintainingitsperformance?

Next: Anomaly Detection Using Support Vector Machine

COMP90073 Security Analytics © University of Melbourne 2021

References

1. Tan, Ting, Liu, “Fast Anomaly Detection for Streaming Data”, International Joint Conference on Artificial Intelligence (IJCAI), 2011

– https://github.com/yli96/HSTree

2. , , Latecki, “Incremental Local Outlier Detection for Data Streams”, IEEE Symposium on Computational Intelligence and Data Mining, 2007

3. , , . Bezdek, , , “Fast Memory Efficient Local Outlier Detection in Data Streams”, IEEE Transactions on Knowledge and Data Engineering (TKDE), 2016

COMP90073 Security Analytics © University of Melbourne 2021