Abstract:In the cloud environment, traditional physical servers are gradually being replaced by various virtual machines.The storage space, occupied by virtual machine images hosted in cloud data centers, has increased dramatically.How to efficiently manage these image files has become one of research hotspots in the cloud computing.Due to the large number of blank duplicate blocks inside the virtual machine image, which leads to a high degree of internal redundancy of the image.Second, different virtual machine images may run the same operating system and applications, so that there is more duplicate data between the images.For a large number of virtual machine images, the traditional deduplication strategy will generate huge time overhead, and will consume huge memory space and CPU resources, which will affect the performance of the data center.This paper proposes a multilevel deduplication method based on improved Simhash algorithm for massive virtual machine images, which divides a complete image file into operating system image segment and application data image segment, extracts the feature values of each part, and uses DBSCAN clustering algorithm for grouping the image segments.In this way, the image segments with higher similarity are grouped into one class, thereby decomposing the global deduplication into smaller internal weights with higher repetition rate, and the fingerprint index data is completely stored in the memory.This deduplication algorithm greatly reduces the number of disk I/Os and shortens the deduplication time.