SharePoint Online: Find Duplicate Files using PowerShell

Requirement: Find Duplicate Documents in SharePoint Online.

When multiple people from different teams work together, there is a huge possibility for duplicate content in SharePoint. People may have uploaded the same document to various libraries or even different folders in document libraries. So, How do you find duplicate documents in SharePoint Online? Well, In this blog post, I will show you how to locate duplicate files in SharePoint Online.

sharepoint online find duplicate files

SharePoint Online: Find Duplicate Documents using PowerShell – File Hash Method

How to find duplicate files in SharePoint Online? Let’s find duplicate files in a SharePoint Online document library by comparing file hash:

#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
 
#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/marketing"
$ListName ="Branding"
 
#Array to Results Data
$DataCollection = @() 
 
#Get credentials to connect
$Cred = Get-Credential
 
Try {
    #Setup the Context
    $Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
    $Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)
 
    #Get the Web and List
    $Web = $Ctx.Web
    $Ctx.Load($Web)
    $List = $Ctx.Web.Lists.GetByTitle($ListName)
    $Ctx.Load($List)
    $Ctx.ExecuteQuery()

    #Define Query to get List Items in batch
    $BatchSize = 2000
    $Query = New-Object Microsoft.SharePoint.Client.CamlQuery
    $Query.ViewXml = @"
    <View Scope='RecursiveAll'>
        <Query>
            <OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy>
        </Query>
        <RowLimit Paged="TRUE">$BatchSize</RowLimit>
    </View>
"@

    #Get List Items in Batch
    $Count=1
    Do
    {
        $ListItems = $List.GetItems($Query)
        $Ctx.Load($ListItems)
        $Ctx.ExecuteQuery()
        
        #Process all items in the batch    
        ForEach($Item in $ListItems)
        {
            #Fiter Files
            If($Item.FileSystemObjectType -eq "File")
            {
                #Get the File from Item
                $File = $Item.File
                $Ctx.Load($File)
                $Ctx.ExecuteQuery()
                Write-Progress -PercentComplete ($Count / $List.ItemCount * 100) -Activity "Processing File $count of $($List.ItemCount)" -Status "Scanning File '$($File.Name)'"
 
                #Get The File Hash
                $Bytes = $Item.file.OpenBinaryStream()
                $Ctx.ExecuteQuery()
                $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider 
                $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value)) 
 
                #Collect data        
                $Data = New-Object PSObject 
                $Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name
                $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
                $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
                $DataCollection += $Data
            }
            $Count++
        }
        $Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
    }While($Query.ListItemCollectionPosition -ne $null)
    
    #Get Duplicate Files
    $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    If($Duplicates.Count -gt 1)
    {
        $Duplicates | Out-GridView
    }
    Else
    {
        Write-host -f Yellow "No Duplicates Found!"
    }
}
Catch {
    write-host -f Red "Error:" $_.Exception.Message
}

However, this method does not work for Office documents like .docx, .pptx, .xlsx, etc., because the metadata for Office documents in SharePoint is stored within the document itself, whereas for other document types, the metadata is stored in the SharePoint content database. So, when you upload the same Office document twice, their metadata like “Created On” differs!

PowerShell to Find All Duplicate Files in a Site (Compare Hash, File Name, and File Size)

This PowerShell script scans all files from all document libraries in a site and extracts the File Name, File Hash, and Size parameters for comparison output a CSV report with all data.

#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"

#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/marketing"
$CSVPath = "C:\Temp\Duplicates.csv"
$BatchSize = 2000
#Array for Result Data
$DataCollection = @()

#Get credentials to connect
$Cred = Get-Credential

Try {
    #Setup the Context
    $Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
    $Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)

    #Get the Web
    $Web = $Ctx.Web
    $Lists = $Web.Lists
    $Ctx.Load($Web)
    $Ctx.Load($Lists)
    $Ctx.ExecuteQuery()

    #Iterate through Each List on the web
    ForEach($List in $Lists)
    {
        #Filter Lists
        If($List.BaseType -eq "DocumentLibrary" -and $List.Hidden -eq $False  -and $List.ItemCount -gt 0 -and $List.Title -Notin("Site Pages","Style Library", "Preservation Hold Library"))
        {
            #Define CAML Query to get Files from the list in batches
            $Query = New-Object Microsoft.SharePoint.Client.CamlQuery
            $Query.ViewXml = "@
                <View Scope='RecursiveAll'>
                    <Query>
                        <OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy>            
                    </Query>
                    <RowLimit Paged='TRUE'>$BatchSize</RowLimit>
                </View>"

            $Counter = 1
            #Get Files from the Library in Batches
            Do {
                $ListItems = $List.GetItems($Query)
                $Ctx.Load($ListItems)
                $Ctx.ExecuteQuery()

                ForEach($Item in $ListItems)
                {
                    #Fiter Files
                    If($Item.FileSystemObjectType -eq "File")
                    {
                        #Get the File from Item
                        $File = $Item.File
                        $Ctx.Load($File)
                        $Ctx.ExecuteQuery()
                        Write-Progress -PercentComplete ($Counter / $List.ItemCount * 100) -Activity "Processing File $Counter of $($List.ItemCount) in $($List.Title) of $($Web.URL)" -Status "Scanning File '$($File.Name)'" 

                        #Get The File Hash
                        $Bytes = $File.OpenBinaryStream()
                        $Ctx.ExecuteQuery()
                        $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider 
                        $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value)) 

                        #Collect data        
                        $Data = New-Object PSObject 
                        $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
                        $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
                        $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
                        $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length        
                        $DataCollection += $Data
                    }
                    $Counter++
                }
                #Update Postion of the ListItemCollectionPosition
                $Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
            }While($Query.ListItemCollectionPosition -ne $null)
        }
    }
    #Export All Data to CSV
    $DataCollection | Export-Csv -Path $CSVPath -NoTypeInformation
    Write-host -f Green "Files Inventory has been Exported to $CSVPath"

    #Get Duplicate Files by Grouping Hash code
    $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Duplicate Files Based on File Hashcode:"
    $Duplicates | Format-table -AutoSize

    #Group Based on File Name
    $FileNameDuplicates = $DataCollection | Group-Object -Property FileName | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Potential Duplicate Based on File Name:"
    $FileNameDuplicates| Format-table -AutoSize

    #Group Based on File Size
    $FileSizeDuplicates = $DataCollection | Group-Object -Property FileSize | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
    Write-host "Potential Duplicates Based on File Size:"
    $FileSizeDuplicates| Format-table -AutoSize
}
Catch {
    write-host -f Red "Error:" $_.Exception.Message
}

This can be a helpful tool if you try to clean up your SharePoint environment and free up some disk space.

PnP PowerShell to Find Duplicate Files in a SharePoint Online Site

This time, let’s use PnP PowerShell to scan and find duplicate files from all document libraries in a site and export the results to a CSV file!

#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/Purchase"
$Pagesize = 2000
$ReportOutput = "C:\Temp\Duplicates.csv"

#Connect to SharePoint Online site
Connect-PnPOnline $SiteURL -Interactive
 
#Array to store results
$DataCollection = @()

#Get all Document libraries
$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin("Site Pages","Style Library", "Preservation Hold Library")}

#Iterate through each document library
ForEach($Library in $DocumentLibraries)
{    
    #Get All documents from the library
    $global:counter = 0;
    $Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock `
        { Param($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
             "Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
  
    $ItemCounter = 0
    #Iterate through each document
    Foreach($Document in $Documents)
    {
        #Get the File from Item
        $File = Get-PnPProperty -ClientObject $Document -Property File

        #Get The File Hash
        $Bytes = $File.OpenBinaryStream()
        Invoke-PnPQuery
        $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
        $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
 
        #Collect data        
        $Data = New-Object PSObject 
        $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
        $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
        $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
        $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length        
        $DataCollection += $Data
        $ItemCounter++
        Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
                     -Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
    }
}
#Get Duplicate Files by Grouping Hash code
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1}  | Select -ExpandProperty Group
Write-host "Duplicate Files Based on File Hashcode:"
$Duplicates | Format-table -AutoSize

#Export the duplicates results to CSV
$Duplicates | Export-Csv -Path $ReportOutput -NoTypeInformation

Salaudeen Rajack

Salaudeen Rajack is a SharePoint Architect with Two decades of SharePoint Experience. He loves sharing his knowledge and experiences with the SharePoint community, through his real-world articles!

11 thoughts on “SharePoint Online: Find Duplicate Files using PowerShell

  • Hi Sir,

    Can you publish another script to verify duplicate files in all sites (Tenant level)?

    Thanks,
    S

    Reply
  • Any chance anyone already edited this script to run in batches?

    Reply
  • Hi Salaudeen,
    First, your website is a gold mine.

    Second, Have You this script with pnp ?

    Thanks

    Reply
  • How would you pull a report for a site with more than 5,000 items?

    Reply
      • First at all, thanks for your job, it’s amazing. Second, how can we change the code above to break the 5000 items limitation? I tried with your post (SharePoint Online: How to Get All List Items from Large Lists ( >5000 Items)), but I’m not able.
        Thank you!

        Reply
    • Hi

      Thanks for this useful article!
      For more than 5000 objects, can also be used a quick Pnp commandlet:

      Get-PnPListItem -List $List -PageSize 5000

      Reply
    • Right, PnP Supports batch processing List Items natively! But in CSOM you have to change your script a bit as given in the above comment.

      Reply

Leave a Reply