import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest
from sklearn.datasets import make_classification
# Generate synthetic data with outliers
= make_classification(n_samples=1000, n_features=2, n_informative=2, n_redundant=0, n_clusters_per_class=1, flip_y=0, random_state=1)
X, _ = np.random.uniform(low=-4, high=4, size=(50, 2))
outliers = np.vstack([X, outliers])
X
# Fit Isolation Forest model
= IsolationForest(contamination=0.05, random_state=42)
clf
clf.fit(X)
# Predict outliers
= clf.predict(X)
y_pred
# Visualize the results
=(10, 6))
plt.figure(figsize
# Plot the inliers
== 1][:, 0], X[y_pred == 1][:, 1], c='green', label='Inliers', alpha=0.8, edgecolors='k')
plt.scatter(X[y_pred
# Plot the outliers
== -1][:, 0], X[y_pred == -1][:, 1], c='red', label='Outliers', alpha=0.8, edgecolors='k')
plt.scatter(X[y_pred
# Plot decision boundary
= np.meshgrid(np.linspace(X[:, 0].min(), X[:, 0].max(), 100), np.linspace(X[:, 1].min(), X[:, 1].max(), 100))
xx, yy = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
Z =[0], linewidths=2, colors='black')
plt.contour(xx, yy, Z, levels
'Isolation Forest for Outlier Detection')
plt.title('Feature 1')
plt.xlabel('Feature 2')
plt.ylabel(
plt.legend() plt.show()
Anomalies, by definition, are data points that deviate significantly from the majority of the dataset. Detecting these outliers is crucial in fields such as fraud detection, network security, healthcare, and more. In this blog, we will explore the concept of anomaly detection, its significance, common techniques, and real-world applications. The goal of outlier detection is to separate a core of regular observations from those that are irregular.
Why Detect Outliers?
In Machine Learning, data cleaning and preprocessing are essential steps to understand your data. Running ML algorithms without removing outliers causes less effective and useful models. Sometimes, it is essential to understand the context of your dataset to differentiate between true outliers versus changing trends in your data.
Common Techniques for Anomaly Detection:
Statistical Methods: Statistical approaches involve setting thresholds based on mean, median, standard deviation, or other statistical measures. Data points deviating beyond these thresholds are considered anomalies.
Machine Learning Models: Supervised and unsupervised machine learning models, such as isolation forests, one-class SVM, and autoencoders, can be trained to distinguish between normal and anomalous data points.
Clustering Algorithms: Clustering techniques, like k-means, can identify outliers by assigning them to clusters with fewer data points.
Density-Based Methods: Algorithms like DBSCAN (Density-Based Spatial Clustering of Applications with Noise) identify outliers based on low-density regions in the data space.