Suitability of Clustering Algorithms for Crime Hotspot Analysis
Divya G
SCT College of Engineering Pappanamcode
Trivandrum
Rejimol Robinson R R Asst. Professor
SCT College of Engineering Pappanamcode, Trivandrum
Kalai Selvan Joint Director BCG, CDAC Trivandrum
Abstract-Crime analysis is a field that needs immediate attention due to the drastic increase of number of crimes. Most of the crimes, from the past experiences of the police, are said to be concentrated in some areas, called hotspots and keep on recurring. Clustering algorithms are best applied to crime analysis, but suitability of broad spectrum of clustering algorithm for an application is an issue to be addressed. In this paper we evaluate three clustering algorithms i.e. hierarchical clustering, k-means clustering and DBSCAN clustering with the intent of finding the best one suitable for crime hotspot analysis. Each one of the clustering algorithm evaluated here need inputs such as number of clusters, neighbour distance, minimum number of points etc. are needed by a cluster. The cluster similarity measure is the Euclidean distance. The results suggest that DBSCAN is much more suitable to crime hotspot analysis due to its inherent nature of being density driven. Keywords: Hotspot Analysis, Clustering, Data Mining, Average Link Algorithm, K-means Algorithm, DBSCAN, Davies- Bouldin index.
I. INTRODUCTION Crime analysis is gaining importance day by day as the number of crimes is increasing drastically. This is true especially in case of crimes against women, which is the prime focus for police nowadays. Crime analytics is a process of gathering and analysing recurring patterns in crime. Crime analytics aids police in intelligently making decisions and proactively planning the defence mechanism. Biggest challenge involved in crime analytics is processing large amount of historical data. Also the police personnel may have a very busy schedule to process the data. If crime analytics can be performed so that the data is processed and patterns are generated out of given data and patterns thus generated if displayed on the GIS based map, can help police personnel in making decisions faster. Police personnel can take proactive actions such as assigning more force at places where there is high probability of crime occurrence. Hotspots are high concentration areas of some activity. Crime hotspots are concentration of crimes in specific geographical area, recurring over time. Experience of police shows that certain geographical areas called crime generators have high probability of occurrence of crime.Though there is no theoretical basis, but police can use the knowledge of hotspots in reducing the crime rates. Hotspots are areas of imaginary boundary where there is recurrence of crime incidents. Clustering is a process of grouping of elements in a dataset in such a way that the elements in a cluster are
similar to one another than the elements in other cluster. Grouping of elements happens by evaluating the elements of dataset over some similarity metrics. Clustering is a generic representation for a class of algorithms, used for the extracting similarity in dataset. This data mining methodology may require a set of parameters to be specified such as number of clusters, minimum radius of a cluster, minimum set of points around a data element to qualify as a cluster etc. Clustering is an unsupervised learning, where the training set of data is not provided as the clusters learn the patters by themselves. The typical steps of clustering are representation of patterns, definition of data proximity measure appropriate to the data domain, grouping, abstraction and assessment of output [1]. The objective of this paper is to evaluate the different clustering algorithm that can be used effectively for crime hotspot analysis. This paper is structured as follows. Sect. II provides an overview of type of clustering algorithms. Sect. III briefly describes hierarchical clustering and provides the algorithm for average link clustering. Sect. IV presents the partitioning clustering algorithm, K-means with the capability of K-means++. Sect. V describes DBSCAN, density based clustering, and accesses its suitability to cluster geographical data. Experiment based on geographical data to detect crime hotspots are presented in Sect. VI. The evaluation of the three algorithms is provided in Sect. VII. In Sect. VIII, we summarize the evaluation and discuss the future work.
II. CLUSTERING ALGORITHMS
A particular algorithm cannot be correct or wrong. Suitability of cluster algorithm depends on the kind of application and dataset. As said “clustering is in the eye of the beholder”[2]. The specific requirements of an application may require usage of an algorithm. Suitability of algorithms for an application should be experimentally found. The ranges of algorithms for clustering are numerous. References [3] and [4] survey the broad range of clustering algorithms. Clustering can be hard, where points belong to one set or other but not both, or soft, where the points can belong to different clusters.Clustering algorithms can be broadly classified into hierarchical and partitioning.Hierarchical clustering builds a tree structure from the dataset, starting from grouping individual point to form a parent node combining all the points in the dataset. In effect levels of elements of dataset are formed, each
Divya G et al | International Journal of Computer Science Engineering and Technology( IJCSET) | July 2014 | Vol 4, Issue 7,231-234
www.ijcset.net 231
level grouping a subset of elements. Partitioning clusters provide a distinct grouping of the elements of dataset, where the elements are grouped in disjoint set of clusters. The simplest of the partitioning clustering is nearest neighbour clustering. This method aims to find the nearest neighbours of the given elements of datasets [5]. K- nearest neighbour clustering aims to form K clusters by taking each element and assigning it to clusters having shortest distance. The first cluster is formed by assuming the K to be first point. Though the simplicity of KNN gives it an edge over others but, decision about the value of K is critical. The final clusters may not be as close as the actual ones. K means algorithm assumes K clusters and tries to move element to the cluster having closer mean. This recursive algorithm will produce K clusters containing the elements of the dataset partitioned so that points in a cluster are closer to the mean than others. Density based clustering techniques look for areas of high density. Areas that are sparse are ignored. DBSCAN [7] is a popular density based clustering algorithm. This algorithm takes into consideration two parameters (i.e. minimum distance and minimum number of points) and generates the clusters automatically. OPTICS [8] is another density based clustering that does not require specifying the parameter, minimum distance, and produces a structure like hierarchical clustering.
III. HIERARCHICAL CLUSTERING Hierarchical clustering technique considers each point as a single cluster and merges them together until all the points are merged into one cluster. The merging of clusters is often represented as dendrogram. A dendrogram is a tree-like structure used to represent the hierarchy of cluster merging in hierarchical clustering. Let the dataset consists of < x1, x2, x3……..,xn>, where n is the number of elements of the dataset. Each element of dataset xi can contain many attributes, represented by < a1,a2,a3, ……,am > , where m is the number of attributes of an element of dataset.
A. Algorithm Input: < x1, x2, x3……..,xn> Output: <Li> ,<Cij> where i is the number of levels and j is the number of clusters in each level. 1. Assign Ci = xi for all the elements of the
dataset. 2. Calculate the adjacency matrix for all the
clusters Aij= d(xi,xj)
whereAij is an n x n matrix and d(xi,xj) can be defined as Euclidean distance.
3. Let level l=1; 4. Let p =n be the number of clusters at each
level. 5. Find the most similar clusters from the
adjacency matrix say Cu and Ct using a similarity criterion.
6. Merge the clusters and p=p-1. 7. Perform steps 3 and 4 for all clusters in a
level l.
8. If total number of clusters =1 then stop. 9. If no more clusters are there to be added then
l=l+1, update Aij and go to step 4. The algorithm above provides the details of the hierarchical clustering. The result of the algorithm is the level and clusters in each level. Adjacency matrix for all the elements is calculated by finding out the distance between every two elements. Euclidean distance is used for calculation of adjacency matrix. Euclidean distance between two elements is defined as the square root of the sum of squares of the difference of each attribute of the two elements. Initially individual elements of the dataset are all treated as clusters. So at level 1 there will be as many as number of clusters as there are number of elements in the dataset. Subsequently, the clusters are merged using the similarity criteria and adjacency matrix is updated. The similarity criteria for the distance between the clusters can be based on three metrics: single link, complete link and average link. Single link will consider the distance between the clusters as the distance between closest elements of the cluster. Complete link takes into account the distance between the farthest elements in the two clusters. Average link calculates the similarity criteria for clusters by calculating the average of the distances of all the elements in both the clusters. The equation for average link hierarchical clustering is as given below:
D(M,N) = ∗ ∑ ∑ ( ( ), ( )) (1) where M and N are two clusters and d(x(i),y(j)) is the distance between the elements of the two clusters. m and n are the number of points in the two clusters.
IV. PARTITIONAL CLUSTERING Partitioning clustering will group the elements of dataset into different clusters in such a way that one element belongs to only one cluster. Most popular and simplest partitioning algorithm is k-means algorithm. K- means assumes two parameters: k, number of clusters and initial value of centroids.
A. Algorithm Input: < x1, x2, x3……..,xn> Output:<Cj> where j is jth clusters. 1. Assume the value of k, the number of clusters. 2. For all the k clusters assume initial value of
centroids. 3. For each element xi calculate the Euclidean
distance from the element to the centroid of all clusters
4. Move the element to closest cluster. 5. Calculate the new centroid. 6. If there are no more changes then stop else
repeat steps 3,4 and 5. The output is k clusters, each of the clusters contains a subset of elements of the dataset. Here, starting from the first element find the closest cluster by calculating the Euclidean distance. Each element is moved to the closest cluster. Centroid of the cluster is recomputed. This process continues and halts when there are no changes to clusters. K-means is very efficient algorithm. But, the efficiency is limited by two factors: number of clusters and
Divya G et al | International Journal of Computer Science Engineering and Technology( IJCSET) | July 2014 | Vol 4, Issue 7,231-234
www.ijcset.net 232
the initial value of centroids. The initial value of centroid can be calculated using k-means++ algorithm [9]. k- means++ uses weighted probability distribution to choose centroids. Another method for choosing the initial value of centroids is randomly selecting k values from the dataset. Yet another limiting factor is that all the elements will be clustered. Noise elements will not be considered. Even if a point is standalone, it will be grouped in some cluster.
V. DENSITY BASED CLUSTERING Density based clustering methods aims to build clusters with high density areas than the remainder of the dataset. The sparse areas are considered as noise and are ignored. DBSCAN (Density-based spatial clustering of application s with noise) is the most popular density based algorithm. This algorithm needs two parameters: eps, minimum distance of a point from a cluster to consider it as a neighbour and minimum number of neighbour points to group them into a cluster. The concept of minimum distance from the cluster is called density reachability.
A. Algorithm Input: < x1, x2, x3……..,xn>, eps, minPoints Output:<Cj> where j is jth clusters. 1. Mark each point xi UNCLASSIFIED. 2. Calculate the adjacency matrix of the input. 3. Let i=1 4. Consider xi check if xi is UNCLASSIFIED. 5. If yes then get neighbours of xi such that the
distance of each point to xi is not more than eps. 5.1 If number of such neighbours is greater
than min points create new cluster and add xi and all its neighbours to this new cluster. Also mark all of them CLASSIFIED.
5.2 Otherwise mark this as ISNOISE. 6. i=i+1 7. If i>n stop.
In this algorithm the points having nearby neighbours are formed as clusters and others which do not have enough density are ignored as noise. So we get only the clusters having high density. All the points are initially marked as UNCLASSIFIED. Each point is then visited and checked to see if it has enough neighbours in the distance eps. If so it is considered as clusters and all the neighbours are added to this cluster. If it does not have enough points then it is declared as noise. So the algorithm successively recognizes all the clusters with high density.
VI. EXPERIMENT
We have evaluated the suitability of clustering algorithm over a dataset of around 300 elements. Each of the element represents some crime happened along with its latitude and longitude. All the three algorithms were evaluated over this dataset. The various criterion were:
• Number of clusters(except K means) • Average number of elements in clusters. • Running time of algorithm in miiliseconds. • Davies – Bouldin index.
Number of clusters provides the count of clusters identified by each algorithm. But in case of K means we need to provide the number of clusters. When the clusters and its elements are identified, average number of elements in each cluster will let us conclude that clusters are large enough. Also by using average we can eliminate clusters having relatively lower elements. Running time of the algorithm gives us the time elapsed between providing input to producing outputs. It was programmatically computed from the starting of the code to ending of the algorithm. Davies – Bouldin index is an evaluation index for clustering. It can be calculated from the below given formula. = ∑ max (∑ ( ) ( )( ( ), ( )). ) (2) where k is the number of clusters. avg(i) is the average distance of elements of the cluster to its centroid. c(i) is the centroid of cluster. distance(c(i),c(j)) is the distance between centroids of cluster i and j. It is measured as Euclidean distance. For Davies- Bouldin index the smaller value is preferred.
VII. RESULTS AND DISCUSSION The dataset of around 300 crime locations was provided as input to the algorithms. Each of the location contains the latitude and longitude of each point. The results of the evaluation over the different evaluation criterion are presented in tables below. The results for the evaluation of hierarchical clustering are shown in Table1. The total running time for the hierarchical clustering is very high compared to other algorithms. From the input dataset 11 levels of clusters were formed, starting from each element being a cluster and finally all points merged into one cluster. The Davies- Bouldin index is very high. The lowest value of Davies – Bouldin index is generated at 10th level where the number of clusters is two. For K-means clustering, the input parameter is the number of clusters. K-means algorithm was evaluated by providing different values for k i.e. the number of clusters. Each criterion was recorded for each input value of k. Results are provided in Table 2.The lowest value of Davies – Bouldin is achieved when the value of k is five. DBSCAN Algorithm achieves the best performance. The input values that are to be provided are the neighbourhood distance and the minimum number of points for an element in the dataset to be declared as a cluster. The most attractive feature of DBSCAN algorithm is that the number of clusters is automatically found and stand-alone points are labelled noise. The elements designated as noise are not included in the clusters. Smallest Davies – Bouldin index is attained when neighbourhood distance is 0.5 and minimum number of points ranges from 10 to 20.
Divya G et al | International Journal of Computer Science Engineering and Technology( IJCSET) | July 2014 | Vol 4, Issue 7,231-234
www.ijcset.net 233
Table 1: Evaluation of hierarchical clustering
LEVELS EVALUATION CRITERIA
No of Clusters Running Time (in milliseconds)
Average number of elements in each cluster Davies – Bouldin index
1 321
28757
1 0 2 161 1 Infinity 3 81 3 Infinity 4 41 7 374.7319 5 21 15 25.1641 6 11 29 41.1954 7 6 53 56.5601 8 4 80 30.0025 9 3 107 10.5658
10 2 160 0.0007498 11 1 321 0
Table 2: Evaluation of K means Clustering algorithm
PARAMETERS (that can be changed)
VALUE OF INPUT PARAMETERS
Evaluation Criterion Running Time (in
milliseconds) Average number of
elements in each cluster Davies – Bouldin
index
K,number of clusters
30 1301 10 0.8146526 25 386 12 0.8060029 20 391 16 0.7425921 15 358 21 0.7179025 10 293 32 0.6043710 5 303 64 0.340549
Table 3: Evaluation of DBSCAN Algorithm
PARAMETERS (that can be
changed) VALUE OF INPUT PARAMETERS
EVALUATION CRITERIA
No of Clusters
Running Time (in
milliseconds)
Average number of elements in each cluster
Davies – Bouldin
index
Neighbour Threshold , Minimum number of points
NeighborThreshold=0.5, minPoints=10,15,20 2 583 160 0.0007498 NeighborThreshold=0.05, minPoints=10 2 432 158 4.7728222 NeighborThreshold=0.05, minPoints=15,20 3 469 103 1.6291838 NeighborThreshold=0.005, minPoints=10,15,20 4 419 74 0.9802176 NeighborThreshold=0.0005, minPoints=10,15,20 5 437 52 1.2913722
VIII. CONCLUSION
Crime analysis is an important area as the number of crime is increasing these days. Data mining provides immense tools for crime analysis. Research on application of data mining algorithms in crime analysis is an area that needs much ore contribution. Crime hotspot analysis can provide the police personnel, knowledge on the crime generators. Police can take proactive actions to control crime in hotspot areas. DBSCAN is the effective algorithm for crime hotspot analysis.
REFERENCES [1] A.K. JAIN, M.N. MURTY and P.J. FLYNN, “Data Clustering : A
Review”, ACM Computing Surveys, Vol 31, No.3, September 1999. [2] Estivill- Castro, Vladimir, “Why so many clustering algorithms – A
Positive Paper”, ACM SIGKDD Explorations NewletterVol 4 Issue 1, pp 65-75, June 2002.
[3] Xindong Wu et al. , “Top 10 algorithms in data mining” Springer Knowledge and Information Systems, Vol 14, Issue 1, pp 1-37, January 2008
[4] RuiXu, Donald Wunsch II,”Survey of Clustering Algorithms”, IEEE Transactions on Neural Networks, Vol. 16, No. 3, May 2005
[5] SebastienBubeck, Ulrike von Luxburg,” Nearest Neighbor Clustering: A Baseline Method for Consistent Clustering with Arbitrary Objective Functions”, Journal of Machine Learning Research Vol 10,pp 657-698,2009
[6] Nitin Bhatia, Vanadana,” Survey of Nearest Neighbor Techniques”, International Journal of Computer Science and Information Security (IJCSIS), Vol.8, No.2, pp 302-305, 2010.
[7] Martin Ester, Hans-Peter Kriegel, Jörg Sander, and XiaoweiXu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise ”,Proceedings of the Second International Conference on Knowledge Discovery and Data Mining KDD- 96, page 226-231. AAAI Press, (1996).
[8] MihaelAnkerst, Markus M. Breunig, Hans-Peter Kriegel, JörgSander , “OPTICS: Ordering Points To Identify the Clustering Structure”, ACM SIGMOD international conference on Management of data, ACM Press, pp. 49–60, 1999.
[9] Arthur, D. and Vassilvitskii, S. (2007). “k-means++: the advantages of careful seeding”. Proceedings of the eighteenth annual ACM- SIAM symposium on Discrete algorithms. Society for Industrial and Applied Mathematics Philadelphia, PA, USA. pp. 1027–1035.
[10] Kriegel, Hans-Peter; Kröger, Peer; Sander, Jörg; Zimek, Arthur (2011). “Density-based Clustering”. WIREs Data Mining and Knowledge Discovery 1 (3): 231–240
Divya G et al | International Journal of Computer Science Engineering and Technology( IJCSET) | July 2014 | Vol 4, Issue 7,231-234
www.ijcset.net 234
http://wjst.wu.ac.th Information Technology
Walailak J Sci & Tech 2017; 14(10): 769-781.
Unify Framework for Crime Data Summarization using RSS Feed Service*1 Tichakorn NETSUWAN and Kraisak KESORN* Department of Computer Science and Information Technology, Faculty of Science, Naresuan University, Phitsanulok 65000, Thailand (*Corresponding author’s e-mail: kraisakk@nu.ac.th) Received: 2 September 2016, Revised: 1 April 2017, Accepted: 1 May 2017 Abstract
This research presents online crime news analysis using text mining, Natural Language Processing framework (General Framework for Text Mining: GATE), and data warehouse (DW) technologies. The proposed framework aims at extracting key features of crime data available on newspaper website and classifies them into crime categories which are later transformed into a star schema for speedy retrieving and online analytical processing (OLAP). This system can present data in multidimensional structure to perform data analytics to support police officers for determining the security policies to protect locals and tourists who live in the risk areas. The main novelty of this framework is the demonstration of using information available through RSS feed service to generate reports to support decision making. The experimental results show that the extracted data from the Internet can effectively represent the actual crime data occurred in the study areas (low error rate) and allow data analysts to get an insight of the information represented through OLAP.
Keywords: Crime news, data analytics, OLAP, text mining, data warehouse Introduction
Crime can reflect the social problems and should be prevented or decreased for the good of living of people and tourists in the areas. Typically, crime information is daily presented on the website and read by millions of audience worldwide. Unfortunately, this information is rarely used to benefit for people e.g. safety and security aspects. According to the survey in the first 6 months of 2016 by Numbeo website [1], crime index in Thailand is 52.16 and it is 4th among South-East Asian countries. While Malaysia has the highest crime rate at 65.56 followed by Vietnam at 53.45 and Combodia at 52.72. In Thailand, crime statistics illustrates that Pattaya obtains the highest crime at 58.55 followed by Phuket (57.38), Bangkok (47.70), and Chiang Mai (34.94). It is noticeable that they are major cities for tourists of the country.
Typically, crime information is manually collected only when there is someone report to the officers and this data will usually be stored in database which allows police officers to retrieve, analyze, and manually generate crime reports which is very labor intensive. The main limitations of this method are 1) data is not up to date as manually collection process is time consuming and 2) scalability problem could occur because numbers of tables and amount of data are increased in the future and, consequently, querying and reporting speed are affected due to integrity checking of Database Management System (DBMS). To solve these limitations, we proposed to exploit crime news available on the Internet to
*Presented at 1st International Conference on Information Technology: October 27th – 28th, 2016
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10) 770
support decision making of police officers by constructing the crime news extraction and analytics system which includes several functions as follows.
1) Extract crime data from online news website and later use for data analytic purpose. 2) Store crime data using star schema in DW which efficiently resolve the scalability problem and
effectively represent data in multidimensional forms, so called online analytical processing (OLAP). 3) Generate interactive reports of this data via website which users can perform drill-down or roll-
up operations. The proposed system can aid police officers to form the policies in order to prevent crime that may occur in the risk areas. Related works
News usually influences our lifestyles and, thus, several researchers exploited news to automate analytics in various aspects. Shojaee et al. [2] proposed a study of classification learning algorithms to predict crime status using crime dataset from University of California, Irvine (UCI) Machine Learning Repository for data mining. However, crime dataset from UCI is not updated and this results in practically useless. Yang et al. [3] studied learning approaches for detecting and tracking news events of interests using Reuters and CNN news stories. However, the presented method only be useful in some circumstances and need users involve in the loop of event identification. Seo et al. [4] presented the financial news analysis for intelligent portfolio management which is a text classification agent that takes advantage of information retrieval techniques to complement quantitative financial information. These news articles were gathered from various electronic news providers e.g. CNN Financial Network, Forbes, Reuters, NewsFactors, Motley Fool, CNet, ZDNet, Morningstar.com, Associate Press (AP), AP Financial, and Business wire. Nonetheless, it is unclear that how they store data for later use and support for scalability issue. Online news allows users to easily access news anytime and anywhere via mobile devices and personal computers. Really Simple Syndication (RSS) technology enables publishers to syndicate data automatically. A standard XML file format ensures compatibility with many different machines/programs. RSS feeds also benefit to users who want to receive timely updates from favorite websites or to aggregate data from many sites. Wanglee et al. [5] developed an automatic news aggregator system which users can read all news. However, crime news information has never been used to support decision making for the police officers in the literatures. Sudhahar et al. [6] presented a system for large scale quantitative narrative analysis (QNA) of news corpora. The task is to identify the key actors (criminals and victims) in news and the actions they performed. The system demonstrated that men were most commonly responsible for crimes against the person, while women and children were most often victims of those crimes. Wang et al. [7] analyzed online news and classified into categories by means of adaptive clustering. Moreover, the news comments were classified into categories such as negative, positive, which are also grouped into clusters helping the experts to get the view of the common people to the news. The main purpose of this work is to help the experts find which news that the people concerned the most. However, the main drawback of this work is it does not store data for offline processing and cannot represent data in multi-dimensions. Thus, it is not efficiently support for decision making of users. Most of researches in the literatures deployed database system to store the extracted data which is usually suffer from integrity checking and resulting in poor querying speed when the number of tables and volume of data become large. To the best of our knowledge, there is no crime news extraction and analytics system existing in the literatures and none of them use data warehouse as an architecture for data storage. Hence, we present this idea as a main contribution in this paper. Materials and methods
To develop online crime news extraction and analysis framework, 4 major components are introduced as illustrated in Figure 1 and can be described as follows.
(1) Collection: crime data in this work focused on English news only and was automatically collected from http://www.pattayapeople.com exploiting RSS feed service. Finally, the collection in this research contains 1,596 crime news.
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10)
771
(2) Extraction and analysis: The collected data is analyzed and extracted key information using GATE framework [8]. GATE is preferred because it is open source framework and well known among researchers in this area. This step includes 3 sub-processes: 1) Name entity extraction to detect names, locations, organizations, address, date etc. 2) Crime terms detection to find key terms related to crimes. This research uses Oxford Learner’s Dictionaries [9] and MyVocabulary [10] to detect crime keywords. Example of crime terms are shown in Table 1. 3) All crime terms are processed in order to extend to other related data. Figure 2 demonstrates the example of other indirectly relevant data that can be extended from the crime terms e.g. gender, criminal, and victims. To find other relevant information, several related information is used e.g. system date, name prefix, and structure grammatical formalisms. Table 2 shows examples of structure grammatical formalism rules for criminal and victim identification. At this stage of experiment, the rules are fixed regarding to grammatical formalism. If a user wants to add more rules, he can manually do this.
(3) Classification: news classification using text mining to classify crime news into 5 categories. Table 3 depicts crime categories [11] used in this work. This task aims at showing the highest number of crime types occurred in Pattaya. To classify crime news, all extracted keywords must be counted the frequency and put them into a matrix (Table 4).
Figure 1 Online crime news extraction and analysis framework.
Finally, the important between keywords and news are computed using TF-IDF (Eqs. (1) and (2)) [12] and Artificial Neural Network (ANN) is applied for document classification task.
)/log( kk DFnIDF = (1)
kjk IDFTFIDFTF ×=− (2) where n is number of news in the collection, TF refers to the frequency of term t in a document and DF represents a number of documents containing term t.
(4) Representation: the extracted crime data is represented in a multidimensional form using OLAP which is a process to integrate data based upon star schema. Microsoft SQL Server 2008 R2 is used as a
Admin
Crime data Extraction 1. Name Entity Extraction
– Date – Location – Person
2. Crime Terms 3. Other Indirectly Relevant Information
Data warehouse
OLAP
News Classification
Online News (RSS)
1
2
3
4User
Export XML files
ETL
(Extract, Transform, Load)
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10) 772
tool in this work for DW construction. It helps to perform data analytics in multidimensional tables and pre-summarized across dimensions to drastically improve query speed over relational databases. In addition, designing data at multiple levels of aggregations allows user to perform drill-down or roll-up operations. Drill-down presents data at a level of increased detail, while roll-up is the reverse operation of drill-down by decreasing detail of data. Figure 3 shows an example of star schema which comprises a fact table and dimension tables.
Figure 2 Example other indirectly relevant information extended from crime data.
Performance evaluations and discussion To evaluate the presented system, the experiments have been conducted using a customized dataset
contain 1,596 crime news. Since Pattaya has highest numbers of crimes, we scope crime news to this area only. Several measures are used to validate the performance of the proposed system e.g. Lift chart, ROC curve, Precision, Recall, and F-measure. The experiments are divided into several sections in order to investigate all aspects of the system performance such as classification performance, data extraction efficiency, and error rate of the extracted data compared to the actual data. Finally, an example of OLAP report is demonstrated. The evaluation results are described as the follows sections. Table 1 Example crime terms.
Crime terms
killing murder slash jump stab shoot fire
smash rape steal hit fight punch skid
threaten rob snatch extort ransack pickpocket theft
impound cheat pretend bribe bribing attack blow
kidnapping abduct electrocuted arm terrorist prostitutes YABA
kick assault violate thieving drug gamble addict
Example sentence
<Sentence> Dongtan sub police station was called out by the Sawang Boriboon Rescue
Volunteers in the early hours of <Date> Monday , 13th October </Date> to investigate
an attempted suicide by a Swiss male expat .<Person> Mr . Hans Maurer </Person>,
aged around 60 was found unconscious in a parked Chevrolet in a secluded area of
Tesco Lotus car park off <Location> Thepprasit Road </Location>. Sentence>
Crime data extraction Other indirectly relevant information
Date : Monday , 13th October
Person : Mr. Hans Maurer
Location : Thepprasit Road
Year : 2014
Gender : Male
Identify Person : Victim
Region : South Pattaya
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10)
773
Table 2 Criminal identification rules based on sentence structures.
Criminal identification rules Example sentence Rules 1 found keywords “arrest”, “caught”, “suspect”, “detained”
Sriracha police announced the arrest of Mr. Watid on Friday.
Rules 2 <Person> + active voice Or Passive voice + <Person>
Mr. Vladimir Ragoziw who was hit by Mr. Paing riding along this busy highway.
Victim identification rules Example sentence Rules 1 found a keyword “victim” The victim was 26 year old Russian tourist Mr. Chore
Dnichonko, identified from a rental motorcycle paper found on the body.
Rules 2 Passive voice <Person> + passive voice
Pattaya police received a call from Mr. Rewat, a motorcycle taxi rider, to report that he had just been threatened and robbed by a male customer.
Table 3 Crime categories.
Crime category Crime feature
1. Serious offense Murder, Kidnapping, Arson 2. Bodily harm case Manslaughter, Death by negligence, Attempted to murder, Assault,
Rape 3. Crimes against property Theft, Snatching, Blackmail, Extortion, Receiving stolen property,
Vandalism 4. Interesting case Motorcycle theft, Car theft, Bus robbery, Taxi robbery, Cheater and
fraud, Misappropriation 5. Corrupt state case Weapon act, Gambling act, Narcotics act, Prostitution act, Materials act
Table 4 News-Terms frequency matrix.
News TF (Term Frequency)
slash stab attack shoot rob steal jump kidnapping kick News1 1 0 0 0 0 0 0 0 1 News2 0 2 1 0 0 0 0 0 0
News3 0 0 1 5 0 0 0 0 0
… … … … … … … … … … News1596 0 0 1 0 0 0 0 1 0 DF (Document frequency) 20 30 35 39 31 172 8 28 5
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10) 774
Figure 3 Star schema of crime news in data warehouse.
News classification performance Two state of the art techniques of data mining are deployed and compared with the presented
technique in order to evaluate the classification power. If the extracted data has high quality, it will enhance the classification performance. We also want to study which data mining is the best classification model for crime news information. In this experiment, Artificial Neural Network (ANN), Decision Tree (DT), and Naïve Bayes (NB) are compared together. The lift chart in Figure 4 illustrates the performance of classification of 3 models where the x-axis represents the percentage of data population and the y-axis is the percentage of correctly data classification. The ideal line represents the model with 100 % correct classification and illustrated by diagonal line (dark bold line) and it is used to measure the classification performance. The Lift chart shows ANN superiors than other 2 models as its performance is closer to the ideal model than others. The average correct classification is about 84.16 % while NB and DT obtain 81.86 and 75.12 %, respectively. To confirm this result, Relative Operating Characteristic (ROC) curve is used to re-validate the classification power of those 3 models. ROC curve is a line chart that shows the true positive (TP) rate versus its false positive (FP) rate of a classifier. The best possible prediction technique would obtain a point in the upper left corner of the ROC chart, representing 100 % true positive and 0 % false positive. An ideal model would give a point along a diagonal line from the left bottom to the top right corners which divides the chart. Areas above the diagonal represent good classification results while areas below the diagonal line refer poor results [13,14]. Figure 5 demonstrates that ANN obtains the highest performance (area under the curve = 0.86) followed by NB and DT (area under the curve = 0.83 and 0.64, respectively). This result is consistent with the result in Figure 4 and, thus, this can confirm that ANN the best model for this task. However, there are some parameters of ANN needed to be adjusted to increase the prediction accuracy of ANN. Next section will describe about ANN parameters tuning.
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10)
775
Artificial Neural Network parameters tuning ANN is deeper investigated for parameters adjustment in order to obtain the highest performance of
the model. There are some parameters of ANN that are needed to tune to optimize the classification power. This section aims at investigating the effect of those 2 parameters (Hidden_node_ratio and hold_out_seed) on the performance of ANN prediction model by varying one parameter at a time [15]. The important parameters are adjusted and shown in Table 5. Various pairs of ( δβ , ) values were tried, and the one with the best accuracy was selected. These parameter adjustments result in better classification performance by obtaining 87.40, 88.10 and 87.44 % of precision, recall, and F-measure, respectively as illustrated in Figure 6.
0
20
40
60
80
100
0 10 20 30 40 50 60 70 80 90 100
Po pu
la tio
n Co
rr ec
t %
Overall Population %
IdealModel Artificial neural network Decision Tree Naïve Bays
Figure 4 Lift chart compared the classification performances of 3 models.
Figure 5 ROC curve compared 3 classification models.
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10) 776
Table 5 Artificial Neural Network parameters. Parameter Default Range/Adjusted value Hidden_node_ratio (β ) 4 [0, 4] Holdout_seed (δ ) 0 [0, 4] Maximum_input_attributes 255 107 Maximum_output_attributes 255 1 Simple_size 1,000 1,569
Figure 6 Classification performance after ANN parameters adjustment.
Data extraction efficiency This experiment aims at evaluating GATE framework exploited in this research for Natural
Language Processing (NLP) task. We study the performance of GATE to identify person, location, and date measured by precision, recall and F-measure. The result is shown in Figure 7. The lowest accuracy obtained from GATE is person identification about 77.85 % of precision. We analyzed this result and found that GATE poorly identifies a person with Thai name. This is because GATE is mainly designed for English name entity processing and, consequently, it is not able to detect Thai name as well as lacking of the prefix name. This can lower the name entity recognition (NER) power of GATE. This is also similar to location recognition because they are also Thai names. In contrast, date format in crime news is written using standard format. As a result, GATE effectively recognizes date data and, as such, it obtains highest performance at 90.29, 94.57 and 86.48 % of precision, recall, and F-measure, respectively.
Grammatical formalism rules evaluation Since we extended the result of GATE framework in order to identify other relevant data e.g.
criminal, victim, and gender (Table 2). Therefore, we need to evaluate the performance of the presented grammatical formalism rules. Figure 8 demonstrates the performance of grammatical formalism rules to identify criminal and victim. Victim are identified more correctly than criminal at 83.22 % of precision because it is easier to identify victims based on passive voice whereas criminal is written using various grammatical formalisms. Hence, criminal identification receives poorer performance compared to victim.
Default ANN parameters
Adjusted ANN parameters
80%
82%
84%
86%
88%
90%
Precision Recall F-measure
84.16% 83.40% 83.49%
87.40% 88.10% 87.44%
Default ANN parameters Adjusted ANN parameters
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10)
777
Figure 7 Effectiveness of crime data extraction using GATE framework.
Figure 8 Criminal and victim identification using structure grammatical formalism rules.
80.14%
82.48%81.30%
90.29%
82.97% 86.48%
77.85%
94.57% 85.40%
Precision
RecallF-measure
The effectiveness of crime data extraction
Person Location Date
80.14%
82.48% 81.29%
83.22%
80.95%
82.07%
Precision
RecallF-measure
The effectiveness of identify crime person
Criminal Victim
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10) 778
Figure 9 Normalized data comparison between extracted crime data from the Internet and actual data collected by polices.
Figure 10 Root Mean Squared Error (RMSE) value of each crime category.
-1.5 -1
-0.5 0
0.5 1
1.5
1 2 3 4 5 6 7 8 9 10 11 12
Month
Serious offense
Actually crime Online crime
-1.5 -1
-0.5 0
0.5 1
1.5
1 2 3 4 5 6 7 8 9 10 11 12
Month
Bodily harm case
Actually crime Online crime
-1.5 -1
-0.5 0
0.5 1
1.5
1 2 3 4 5 6 7 8 9 10 11 12
Month
Crimes against property
Actually crime Online crime
-1.5 -1
-0.5 0
0.5 1
1.5
1 2 3 4 5 6 7 8 9 10 11 12
Month
Interesting case
Actually crime Online crime
-1.5 -1
-0.5 0
0.5 1
1.5
1 2 3 4 5 6 7 8 9 10 11 12
Month
Corrupt state case
Actually crime Online crime
Serious offense Bodily harm case
Crimes against property Interesting case
Corrupt state case
0
200
400
600
800
1000
RMSE
24.21 18.87
14.09 23.60
876.57
CA SE
RMSE
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10)
779
Extracted data and actual crime data comparison To investigate whether crime news available on the Internet through RSS feed can represent the
numbers of actual crime data collected by police officers, those 2 sources of data need to be compared and compute the error rate between them. In this work, Root Mean Squared Error (RMSE) is deployed to investigate the quality of the extracted crime data. Figure 9 depicts the comparison number of crime data of those 2 sources in 2015. It is clearly seen that some categories consistency with actual data such as serious offensive body harm case, crime against property, and interesting case (fraud, deceit, cheat etc.) whereas corrupt state case seems to be diverted with the actual data. After deep investigation in this category, we found that the studied website contains a few number of this news. The major reason because this kind of news is not interested by readers. Then, they ignore to put it into their website. As a result, the number of this crime reported online is far different from the actual cases leading to having highest RMSE up to 867.57 while other types of crimes obtain error rate less than 25 as shown in Figure 10.
Example of OLAP report OLAP performs based on the multidimensional data model or star schema. It allows data analysts to
get an insight of the information through fast, consistent, and interactive access to information. Figure 11 shows the example OLAP report of crime in each area as well as number of criminals and victims in Pattaya in 2015 which users can perform drill-down and roll-up operations. This report can support the decision of police officers to determine the policy to decrease the number of crimes in the risk areas for example, arrange more polices to check more often at the high crime frequency location.
Figure 11 Generated report crime grouped by areas of Pattaya in 2015, Thailand.
Criminal and Victim *unit : case
Department of Computer Science and Information Technolog, Faculty of Science, Naresuan University, Phitsanulok, Thailand
© 2015 copyright. Crime Reports. All rights reserved
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10) 780
Conclusions
This research proposed a prototype system that exploits online crime news to support decision making of police officers to determine security and safety policy for locals and visitors. This unify framework automatically extract crime data from RSS feeds of a newspaper website and restructure them into star schema and stored in data warehouse. This allows users can effectively perform data analysis through drill-down and roll-up operations. Several experiments have been conducted to validate the performance of the presented framework and obtained satisfied results. Different from state-of-the arts, we focus on crime domain and exploited GATE framework to process and analyze news which have never been conducted in the previous literatures. Although the presented system work effectively on this task, it cannot replace the existing system of the police because the presented system can only estimate the numbers of crimes but it cannot replace the official data collected by polices. However, it can be used as a guideline for polices to see overview of crimes and their trends which benefit to resolve crime problems in the risk areas and improve the quality of living of natives and tourists. Another limitation of this work is the grammatical formalism rules are fixed. Thus, these rules might not cover all structures of sentences. As a consequence, the system cannot detect key features in some cases. Our future work could modify these rules to be more dynamic and apply GIS to the framework to plot crime density into map and this allows officers to easier monitor crime situation in the desire areas. Acknowledgements
This work has been supported by Naresuan University Research Funding. Contract number R2559C073. References
[1] “South-Eastern Asia: Crime Index by Country 2016 Midyear, Available at: http://www.numbeo.com/crime/rankings_by_country.jsp?title=2016-mid®ion=035, accessed September 2016.
[2] S Shojaee, A Mustapha, F Sidi, and A J Marzanah. A study on classification learning algorithms to predict crime status. Int. J. Digit. Content Tech. Its Appl. 2013; 7, 361-9.
[3] Y Yang, JG Carbonell, RD Brown, T Pierce, BT Archibald and X Liu. Learning approaches for detecting and tracking news events. IEEE Intell. Syst. 1999; 14, 32-43.
[4] YW Seo, J Giampapa and K Sycara. Financial News Analysis for Intelligent Portfolio Management. Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, Technical Report CMU-RI-TR-04- 04, 2004.
[5] J Wanglee, C Thaina, S Yodkaew and L Preechaveerakul. Automatic news aggregator system based on users’ preference. In: Proceedings of the Conference of Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology Association of Thailand. Bangkok, Thailand, 2009, p. 155-60.
[6] S Sudhahar, R Franzosi and N Cristianini. Automating quantitative narrative analysis of news data. In: Proceedings of the 2nd Workshop on Applications of Pattern Analysis. Castro Urdiales, UK, 2011, p. 63-71.
[7] W Wang, X Cui and A Wang. News analysis based on meta-synthesis approach. In: Proceedings of the 32nd Annual IEEE International Computer Software and Applications Conference. Turku, Finland, 2008, p. 923-8.
[8] H Cunningham, D Maynard, K Bontcheva and V Tablan. GATE: An architecture for development of robust HTL applications. In: Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics. Philadelphia, 2002, p. 168-75.
[9] Committing Crime Topic from the Oxford Advanced Learner’s Dictionary, Available at: http://www.oxfordlearnersdictionaries.com/topic/committing_crime, accessed September 2016.
[10] Crime Vocabulary, Crime Word List, Available at: https://myvocabulary.com/word-list/crime- vocabulary, accessed September 2016.
Unify Framework for Crime Data Summarization Tichakorn NETSUWAN and Kraisak KESORN http://wjst.wu.ac.th
Walailak J Sci & Tech 2017; 14(10)
781
[11] P Krongyuth, K Pattanagul, Y Tongrasit, W Chaisiwamongkol, S Ungpansattawong, R Naimsanit and A Maneesriwongul. A multivariate statistical analysis of crime in provincial level of Thailand. KKU Res. J. 2013; 18, 642-50.
[12] T Joachims. A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. In: Proceedings of the 14th International Conference on Machine Learning. San Francisco, USA, 1997, p. 143-51.
[13] JA Swets. Signal Detection Theory and Roc Analysis in Psychology and Diagnostics: Collected Papers, Available at: https://www.questia.com/library/91082318/signal-detection-theory-and-roc- analysis-in-psychology, accessed September 2016.
[14] J Fogarty, RS Baker and SE Hudson. Case studies in the use of ROC curve analysis for sensor- based estimates in human computer interaction. In: Proceedings of Graphics Interface. Victoria, British Columbia, 2005, p. 129-36.
[15] K Kesorn, P Ongruk, J Chompoosri, U Thavara, A Tawatsin and P Siriyasatien. Morbidity rate prediction of Dengue Hemorrhagic Fever (DHF) using the Support Vector Machine and the Aedes aegypti infection rate in similar climates and geographical areas. Plos One 2015; 10, e0125049.
Copyright of Walailak Journal of Science & Technology is the property of Walailak Journal of Science & Technology and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder’s express written permission. However, users may print, download, or email articles for individual use.
JUS-636 Topic 4 Crime Analysis Data
Complete each of the prompts below. NOTE: be sure to support your responses with scholarly research and information from government websites. Additionally, please note the word counts for each prompt.
1. Utilize the GCU Library and government crime analysis websites to research CAD system and RMS, the two major crime analysis systems that are used in law enforcement.
A. What kind of data can crime analysts retrieve from these systems? (200-250-Words)
B. How do crime analysts use this data to create a crime analysis product? (400-500 Words)
2. Access the “Sample products Page” on the IACA website: https://iaca.net/sample-crime-analysis-products/
Choose a sample crime analysis product. Identify what types of data sources you would utilize to create the product you have chosen.
A. Imagine you are a crime analyst who is tasked with creating a bulletin for a “beat” officer near downtown San Jose to help the officer better understand the types of crimes that occur in the area. Complete the crime analysis template (below) by doing the following:
Access the Shawnee Police “All Points Bulletin” to help see what a final product looks like. https://www.iaca.net/CAUDC/Products/ShawneeCigSmashGrabRedacted.pdf
Access the San Jose Crime map. http://www.sjpd.org/crimestats/CrimeReports.html
In the search bar on the map, type in “Downtown San Jose California.”
Complete the template by using the crime data from the San Jose downtown area. List a total of 30 crimes, and include four different types of crimes.
Known Incidents Chart
| Date | Day | Time | Name | Address | City(SJ,CA) |
| 1. | |||||
| 2. | |||||
| 3. | |||||
| 4. | |||||
| 5. | |||||
| 6. | |||||
| 7. | |||||
| 8. | |||||
| 9. | |||||
| 10. | |||||
| 11. | |||||
| 12. | |||||
| 13. | |||||
| 14. | |||||
| 15. | |||||
| 16. | |||||
| 17. | |||||
| 18. | |||||
| 19. | |||||
| 20. | |||||
| 21. | |||||
| 22. | |||||
| 23. | |||||
| 24. | |||||
| 25. | |||||
| 26. | |||||
| 27. | |||||
| 28 | |||||
| 29. | |||||
| 30. |
Incidents Map – Paste in a screenshot of the map of the downtown area of San Jose, CA, that you are using for the crime analysis.
(Paste Map here)
3. From the data you have delineated above, for each of the four crimes, apply Routine Activity Theory to determine the crime incidents and behaviors of the environment, probable suspects, and probable victims likely to be involved in these types of crimes. (700-800 words)
© 2020. Grand Canyon University. All Rights Reserved.
