K-Compress: Simplifying Images through Color Clustering (Python)

Overview

In this project, I developed an image compression technique using K-means clustering to group similar RGB values together. The process involves analyzing the image’s pixel data and clustering colors based on their similarity. Each cluster is then replaced with its centroid, effectively reducing the number of unique colors in the image.

Github

Implementation

K-means clustering is a popular machine-learning algorithm used for partitioning a dataset into distinct groups, or clusters. The algorithm works by initializing a set number of cluster centroids (k), then iteratively assigning each data point to the nearest centroid based on a distance metric like Euclidean distance. The centroids are then recalculated as the mean of all data points assigned to that cluster. This process repeats until the centroids stabilize, resulting in clusters where data points within each cluster are more similar to each other than to those in other clusters.

K-Means Clustering: Applied the K-means algorithm to group similar RGB values, creating clusters of pixels that share similar color properties.
Color Averaging: For each cluster, calculated the centroid (average RGB value) and replaced all pixels in the cluster with this average value.
Compression: By reducing the color palette, the image’s storage size was significantly reduced without a substantial loss in visual quality. Most image formats, like JPEG and PNG, use Huffman coding to compress files by assigning shorter codes to more frequent colors. By using K-means clustering to reduce the number of unique colors in an image, you increase the frequency of these colors. This, in turn, makes Huffman coding more efficient, as it can assign even shorter codes to the now more common colors. As a result, the overall file size decreases, achieving compression without a noticeable loss in visual quality.

Results

The resulting images maintained a high level of visual fidelity while achieving substantial compression rates. The technique proved particularly effective for images with a large number of similar colors, where the reduction in color diversity did not noticeably impact the overall appearance.

Sample Images

To effectively compare the performance and visual impact of the techniques, I intentionally selected three different types of images: a cartoon, an abstract background, and a portrait of a person. Each image presents unique characteristics in terms of color distribution and detail, allowing for a thorough evaluation of how the algorithm performs across diverse scenarios. This selection helps demonstrate the versatility and robustness of the techniques under varying conditions.

After Compression

This series of images showcases the results of applying K-means clustering with different values of k, ranging from 3 to 10. The value of k represents the number of clusters, meaning each image is expressed using k distinct colors. As k increases, the images retain more detail and color variation, illustrating the trade-off between compression and visual fidelity. This progression demonstrates how varying the number of clusters impacts the overall appearance and complexity of the image.

Size of Files

Image	Original Size (KB)	Reduced Size (KB)	Size Reduction (KB)	Percentage Reduction (%)
First Image	73.1	39.4	33.7	46.10
Second Image	163.0	90.6	72.4	44.41
Third Image	90.2	35.4	54.8	60.75

The results demonstrate that K-means clustering effectively compresses images by significantly reducing their file sizes. The first image was compressed by 46.1%, the second by 44.4%, and the third by 60.8%. These substantial reductions indicate that K-means clustering can effectively decrease the storage requirements while maintaining a visually acceptable level of detail, making it a powerful technique for efficient image compression.

Conclusion

This project demonstrates the power of K-means clustering in optimizing image storage. By intelligently reducing the color space, this method offers a balance between image quality and file size, making it ideal for scenarios where storage efficiency is critical.