k-means clustering and its real usecase in the security domain

AYUSH BAJPAI
7 min readJul 20, 2021

hello friends, welcome to my new article, In this article, I will going to discuss K-Means clustering in very detail. So tight your seat belt and get ready to learn about k-means clustering and its use case in the security domain.

🧐What is K-means Clustering?

It is one of the unsupervised learning algorithms that is used to solve clustering problems in the world of machine learning.

This is a major task in exploring data mining, and general techniques for the analysis of statistical data. Clustering is also used in various fields, including machine learning, pattern recognition, image analysis, information retrieval, etc.

🚗key points used in the further article:

Clustering:

Clustering is the task of dividing the data points into a number of groups such that data points in the same groups are more similar to other data points in the same group than those in other groups. In simple words, the aim is to segregate groups with similar traits and assign them into similar clusters.

Cluster: group is equivalent to a cluster in a Machine learning algorithm.

📖How does K-means Clustering work?

below flowchart shows how this algorithm works:

The K-means clustering is used to find clusters in the given data.We can find clusters by using hit and trial method as well as elbow method.

So we gonna see how elbow method should help us to find clusters in the given unlabeled data:

🎯Step 1:

The Elbow method is the best way to find the number of clusters. The elbow method constitutes running K-Means clustering on the dataset.

Next, we will use within-sum-of-squares as a measure to find the optimum number of clusters that can be formed for a given data set. WSS is defined as the sum of the squared distance between each member of the cluster and its corresponding centroid.

It is measured for each value of K. The value of K, which has the least amount of WSS, is taken as the optimum value.

Now, we draw a curve between WSS and the number of clusters.

Here, WSS is shown on the y-axis and number of clusters on the x-axis.

You can see that there is a very gradual change in the value of WSS as the K value increases from 2.

So, you can take the elbow point value as the optimal value of K. It should be either two, three, or at most four. But, beyond that, increasing the number of clusters does not dramatically change the value in WSS, it gets stabilized.

⚙Step 2:

Let’s assume that these are our delivery points:

Step 3:

We can now randomly initialize two points called the cluster centroids.

Here, C1 and C2 are the centroids assigned randomly.

📖Step 3:

Now the distance of each location from the centroid is measured, and each data point is assigned to the centroid, which is closest to it.

This is how the initial grouping is done:

🤷‍♂️Step 4:

Compute the actual centroid of data points for the first group.

👉Step 5:

Reposition the random centroid to the actual centroid.

📡Step 6:

Compute the actual centroid of data points for the second group.

🛠Step 7:

Reposition the random centroid to the actual centroid.

🎯Step 8:

Once the cluster becomes static, the k-means algorithm is said to be converged.

The final cluster with centroids c1 and c2 is as shown below:

⚙ Use case of K-means clustering in Security Domain

Log Classification using K-Means Clustering for Identify Internet User Behavior

The Internet has become a necessity in today’s society; any information is accessible on the internet via web browser. However, these activities could have an impact on users, one of which changes in behavior. This study focuses on the activities of Internet users based on the log data network at an educational institution.

📙K-Means Clustering based on the Number of Visitor:

This category of data will be in the cluster based on many visits to a website. Clustering performed using the K-Means algorithm in SPSS and RapidMiner. SPSS and RapidMiner are used to determine the cluster results obtained whether it is appropriate to proceed at this stage of cyber-profiling analysis.

K-Means algorithm implementation performed by the application SPSS and RapidMiner resulted in three clusters, namely low, medium and high. The first cluster is a cluster with low traffic levels have a total members of 1479 websites, the second cluster is a cluster with moderate traffic levels have a total members of 126 websites, and the last third cluster with high levels of traffic have a members of 33 websites. Initialization of the initial cluster center in the clustering process can be seen in Table 2.

Initialize of initial values of the data in the cluster based on the highest value, the average and the smallest value. In this study there are eight iterations produced to get the right result. This initialization is performed by the application of SPSS and RapidMiner. Iteration history in the clustering process can be seen in Table 3.

Table 3 shows that the need for 8 (eight) iterations to get the proper cluster. SPSS application states that the minimum distance between initial centers is 34. The result of the iteration process in determining the initial clustering center can be seen in Table 4.

The results of clustering that has been done can be seen in Figure 3.

The results of clustering will be explained as follows:

🤷‍📡Cluster 1: this cluster is a cluster with the highest number of members, namely 1479 websites. The first cluster is a cluster with the level of user traffic slightly, ranging from 1–10 visits per website. This cluster has members mostly a website advertising.

📕Cluster 2: websites which were included in this cluster as many as 126, with the number of clusters it then entered in the intermediate category because it has a higher value than the average value generated in the process of clustering ranged at 11- 13 visits per website. This cluster contains more information and news sites.

📡 Cluster 3: this cluster has the fewest members, which is only 33 websites. However, this cluster has the highest traffic levels compared to other clusters. Values in this cluster are at 34–64 visits per website. This cluster contains more search engine and social media websites.

Category Based Website Content :

This section will explain the categories based on website content contained in the research data, the categorization of the website is taken from various sources on the Internet. Based on the 1638 websites obtained, there are 22 types of websites that can be categorized.

In Figure 5 is a type of website that successfully categorized.

These results are used to determine the categories of websites that are frequently accessed by Internet users in the educational institutions, so as to assist in concluding on cyber profiling process.

Analysis Result:

In this study, the log data of networks obtained from educational institutions.

The categorization of data is divided into three categories: low, medium and high. The process of categorization performed by using the K-Means algorithm implemented by SPSS and RapidMiner. The clustering results obtained from the implementation of the K-Means algorithm showed that the use of the Internet for an educational institution to access the search engine, information websites and social media websites. This study is slightly different from the results of a survey conducted [2] which states that the use of the Internet is in this order: networks (social media), information search, chat (messaging), news search, video and email.

hence we have come to our article .🚚🚚hope you enjoyed reading this article.

meet you in next article on another topic.

--

--

AYUSH BAJPAI
0 Followers

ansible learner,coding enthusiast,learning new technology