SharePoint Online: Find Duplicate Files using PowerShell
Requirement: Find Duplicate Documents in SharePoint Online.
When multiple people from different teams work together, there is a huge possibility of duplicate content in SharePoint. People may have uploaded the same document to various libraries or even different folders in document libraries. Duplicate files can take up valuable storage space and can make it difficult to find the correct version of a file. So, How do you find duplicate documents in SharePoint Online? Well, In this blog post, I will show you how to locate duplicate files in SharePoint Online.
SharePoint Online: Find Duplicate Documents using PowerShell – File Hash Method
How to find duplicate files in SharePoint Online? Let’s find duplicate files in a SharePoint Online document library by comparing file hash:
#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/marketing"
$ListName ="Branding"
#Array to Results Data
$DataCollection = @()
#Get credentials to connect
$Cred = Get-Credential
Try {
#Setup the Context
$Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
$Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)
#Get the Web and List
$Web = $Ctx.Web
$Ctx.Load($Web)
$List = $Ctx.Web.Lists.GetByTitle($ListName)
$Ctx.Load($List)
$Ctx.ExecuteQuery()
#Define Query to get List Items in batch
$BatchSize = 2000
$Query = New-Object Microsoft.SharePoint.Client.CamlQuery
$Query.ViewXml = @"
<View Scope='RecursiveAll'>
<Query>
<OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy>
</Query>
<RowLimit Paged="TRUE">$BatchSize</RowLimit>
</View>
"@
#Get List Items in Batch
$Count=1
Do
{
$ListItems = $List.GetItems($Query)
$Ctx.Load($ListItems)
$Ctx.ExecuteQuery()
#Process all items in the batch
ForEach($Item in $ListItems)
{
#Fiter Files
If($Item.FileSystemObjectType -eq "File")
{
#Get the File from Item
$File = $Item.File
$Ctx.Load($File)
$Ctx.ExecuteQuery()
Write-Progress -PercentComplete ($Count / $List.ItemCount * 100) -Activity "Processing File $count of $($List.ItemCount)" -Status "Scanning File '$($File.Name)'"
#Get The File Hash
$Bytes = $Item.file.OpenBinaryStream()
$Ctx.ExecuteQuery()
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
$DataCollection += $Data
}
$Count++
}
$Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
}While($Query.ListItemCollectionPosition -ne $null)
#Get Duplicate Files
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
If($Duplicates.Count -gt 1)
{
$Duplicates | Out-GridView
}
Else
{
Write-host -f Yellow "No Duplicates Found!"
}
}
Catch {
write-host -f Red "Error:" $_.Exception.Message
}
However, this method does not work for Office documents like .docx, .pptx, .xlsx, etc., because the metadata for Office documents in SharePoint is stored within the document itself, whereas for other document types, the metadata is stored in the SharePoint content database. So, when you upload the same Office document twice, their metadata, like “Created On” differs!
PowerShell to Find All Duplicate Files in a Site (Compare Hash, File Name, and File Size)
This PowerShell script scans all files from all document libraries in a site and extracts the File Name, File Hash, and Size parameters for comparison to output a CSV report with all data.
#Load SharePoint CSOM Assemblies
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll"
Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll"
#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/marketing"
$CSVPath = "C:\Temp\Duplicates.csv"
$BatchSize = 2000
#Array for Result Data
$DataCollection = @()
#Get credentials to connect
$Cred = Get-Credential
Try {
#Setup the Context
$Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL)
$Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password)
#Get the Web
$Web = $Ctx.Web
$Lists = $Web.Lists
$Ctx.Load($Web)
$Ctx.Load($Lists)
$Ctx.ExecuteQuery()
#Iterate through Each List on the web
ForEach($List in $Lists)
{
#Filter Lists
If($List.BaseType -eq "DocumentLibrary" -and $List.Hidden -eq $False -and $List.ItemCount -gt 0 -and $List.Title -Notin("Site Pages","Style Library", "Preservation Hold Library"))
{
#Define CAML Query to get Files from the list in batches
$Query = New-Object Microsoft.SharePoint.Client.CamlQuery
$Query.ViewXml = "@
<View Scope='RecursiveAll'>
<Query>
<OrderBy><FieldRef Name='ID' Ascending='TRUE'/></OrderBy>
</Query>
<RowLimit Paged='TRUE'>$BatchSize</RowLimit>
</View>"
$Counter = 1
#Get Files from the Library in Batches
Do {
$ListItems = $List.GetItems($Query)
$Ctx.Load($ListItems)
$Ctx.ExecuteQuery()
ForEach($Item in $ListItems)
{
#Fiter Files
If($Item.FileSystemObjectType -eq "File")
{
#Get the File from Item
$File = $Item.File
$Ctx.Load($File)
$Ctx.ExecuteQuery()
Write-Progress -PercentComplete ($Counter / $List.ItemCount * 100) -Activity "Processing File $Counter of $($List.ItemCount) in $($List.Title) of $($Web.URL)" -Status "Scanning File '$($File.Name)'"
#Get The File Hash
$Bytes = $File.OpenBinaryStream()
$Ctx.ExecuteQuery()
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
$Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length
$DataCollection += $Data
}
$Counter++
}
#Update Postion of the ListItemCollectionPosition
$Query.ListItemCollectionPosition = $ListItems.ListItemCollectionPosition
}While($Query.ListItemCollectionPosition -ne $null)
}
}
#Export All Data to CSV
$DataCollection | Export-Csv -Path $CSVPath -NoTypeInformation
Write-host -f Green "Files Inventory has been Exported to $CSVPath"
#Get Duplicate Files by Grouping Hash code
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicate Files Based on File Hashcode:"
$Duplicates | Format-table -AutoSize
#Group Based on File Name
$FileNameDuplicates = $DataCollection | Group-Object -Property FileName | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Potential Duplicate Based on File Name:"
$FileNameDuplicates| Format-table -AutoSize
#Group Based on File Size
$FileSizeDuplicates = $DataCollection | Group-Object -Property FileSize | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Potential Duplicates Based on File Size:"
$FileSizeDuplicates| Format-table -AutoSize
}
Catch {
write-host -f Red "Error:" $_.Exception.Message
}
This can be a helpful tool if you try to clean up your SharePoint environment and free up some disk space.
PnP PowerShell to Find Duplicate Files in a SharePoint Online Site
This time, let’s use PnP PowerShell to scan and find duplicate files from all document libraries in a site and export the results to a CSV file!
#Parameters
$SiteURL = "https://Crescent.sharepoint.com/sites/Purchase"
$Pagesize = 2000
$ReportOutput = "C:\Temp\Duplicates.csv"
#Connect to SharePoint Online site
Connect-PnPOnline $SiteURL -Interactive
#Array to store results
$DataCollection = @()
#Get all Document libraries
$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq "DocumentLibrary" -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin("Site Pages","Style Library", "Preservation Hold Library")}
#Iterate through each document library
ForEach($Library in $DocumentLibraries)
{
#Get All documents from the library
$global:counter = 0;
$Documents = Get-PnPListItem -List $Library -PageSize $Pagesize -Fields ID, File_x0020_Type -ScriptBlock `
{ Param($items) $global:counter += $items.Count; Write-Progress -PercentComplete ($global:Counter / ($Library.ItemCount) * 100) -Activity `
"Getting Documents from Library '$($Library.Title)'" -Status "Getting Documents data $global:Counter of $($Library.ItemCount)";} | Where {$_.FileSystemObjectType -eq "File"}
$ItemCounter = 0
#Iterate through each document
Foreach($Document in $Documents)
{
#Get the File from Item
$File = Get-PnPProperty -ClientObject $Document -Property File
#Get The File Hash
$Bytes = $File.OpenBinaryStream()
Invoke-PnPQuery
$MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider
$HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value))
#Collect data
$Data = New-Object PSObject
$Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name
$Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode
$Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl
$Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length
$DataCollection += $Data
$ItemCounter++
Write-Progress -PercentComplete ($ItemCounter / ($Library.ItemCount) * 100) -Activity "Collecting data from Documents $ItemCounter of $($Library.ItemCount) from $($Library.Title)" `
-Status "Reading Data from Document '$($Document['FileLeafRef']) at '$($Document['FileRef'])"
}
}
#Get Duplicate Files by Grouping Hash code
$Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group
Write-host "Duplicate Files Based on File Hashcode:"
$Duplicates | Format-table -AutoSize
#Export the duplicates results to CSV
$Duplicates | Export-Csv -Path $ReportOutput -NoTypeInformation
In summary, finding duplicate files in SharePoint Online can be done using PowerShell scripts, as explained above. It’s worth noting that before you start finding duplicate files, you need to have permission to access the site and the files, and also, based on the volume of files, this process may take longer time.
So I had this script running perfectly and left it running overnight as there are approx. 80,000+ files to scan through,sadly windows updated and rebooted the machine.
Now when I try and run the script,I get the following errors:-
Connect-PnPOnline : The ‘Connect-PnPOnline’ command was found in the module ‘PnP.PowerShell’, but the module could not be loaded. For more information, run
‘Import-Module PnP.PowerShell’.
At line:7 char:1
+ Connect-PnPOnline $SiteURL -Interactive
+ ~~~~~~~~~~~~~~~~~
+ CategoryInfo : ObjectNotFound: (Connect-PnPOnline:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CouldNotAutoloadMatchingModule
And when I run the Import-Module command, I get the following:-
Import-Module : Could not load file or assembly ‘System.Runtime, Version=6.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a’ or one of its dependencies. The
system cannot find the file specified.
At line:1 char:1
+ Import-Module PnP.PowerShell
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Import-Module], FileNotFoundException
+ FullyQualifiedErrorId : System.IO.FileNotFoundException,Microsoft.PowerShell.Commands.ImportModuleCommand
I’m no expert with Powershell, but I can’t understand why the script no longer works 🙁
Can you please help?
Conrad
That is a great script but how do we ensure we delete these via a PS script without deleting all of the copies. I have run it and it has found 300K duplicate files.
I expect you’ve resolved this by now, but if not….
This sounds like you have a module installed for one version of Powershell, but you are trying to run it in another version, e.g. you’ve got the PNPPowerShell module installed for 7.2 but you’re running in 5.1.
That’s certainly been my experience when presented with that message
the tenant I am connecting to has MFA enabled. So, when I run this script, it gives the error:
Error: Exception calling “ExecuteQuery” with “0” argument(s): “The sign-in name or password does not match one in the Microsoft account system.”
Can you please give a modified version of the script that prompts me for a web login
Use the PnP PowerShell script! The “Connect-PnPOnline $SiteURL -Interactive” can handle MFA enabled accounts.
That worked! Thanks.
However, it didnt detect the Office duplicate files like pptx and xls. Is there a way we could find those as well using the PnP script?
Hi There,
Great script. How can I add a column for date modified for files?
Use: $File.TimeLastModified
Hi,
When trying to use this script is runs through the first 8-10k files or so before bombing out with error message “Error: Exception calling “ExecuteQuery” with “0” argument(s): “The remote server returned an error: (503) Server Unavailable.”
I wonder if this is some sort of throttling. I’ve been experimenting with adding a sleep cycle to the looop. Alas my powershell scripting is not the best. Any thoughts on addind a sleep cycle in the loop?
Note: i accidentally added this comment for another article of yours. I am referring to your v2 script for finding duplicates in sharepoint.
Hello,
I tested the PnP version of the script. It works fine with picture, .txt, .xml, .pdf. But, with Office files (.pptx, .docx, …), the hash code is not the same. I used “copy to” feature for copying file from one documents library to another one. I also uploaded Office documents, it works for two but the site contains more copy of this file.
Can you explain me what could be the reason ?
Hello,
I tested the PnP version of the script. It works fine with picture, .txt, .xml, .pdf. But, with Office files (.pptx, .docx, …), the hash code is not the same. I used “copy to” feature for copying file from one documents library to another one. I also uploaded Office documents, it works for two but the site contains more copy of this file.
Can you explain me the reason ?
Hello I am getting this error:
Error: Exception calling “ExecuteQuery” with “0” argument(s): “Identity Client Runtime Library (IDCRL) did not get a response from the Login server.”
Do you know what could be causing this?
I’m running the PnP dupe check right now… I hope that people’s talk about Pagesize and item limits doesn’t curse me. I’m at about an hour in on the >50k items looking for duplicates.
Thanks for the awesome PowerShell Scripts! I am running the “PnP PowerShell to Find Duplicate Files” script against a large Document Library and it is running fine over 5 days! However, I tried the “PowerShell to Find All Duplicate Files in a Site (Compare Hash, File Name, and File Size)” script to see how that would fair run against the same Document Library and it failed a couple of times after getting through about 400 files with this error: Error: Exception calling “ExecuteQuery” with “0” argument(s): “Invalid MIME content-length header encountered on read.”
Hi! I am pretty new to Sharepoint online and Power Shell. I needed to run this script to find duplicate files. I got your script and I executed it. I executed this script on very large Sharepoint site. I got the result, but most of file names were blank.
I got following errors multiple times.
At line:33 char:9
+ Invoke-PnPQuery
+ ~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [Invoke-PnPQuery], TargetInvoc
ationException
+ FullyQualifiedErrorId : System.Reflection.TargetInvocationException,PnP.
PowerShell.Commands.Base.InvokeQuery
Multiple ambiguous overloads found for “ComputeHash” and the argument count: “1
“.
At line:35 char:9
+ $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash( …
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : NotSpecified: (:) [], MethodException
+ FullyQualifiedErrorId : MethodCountCouldNotFindBest
Get-PnPProperty : Målet forårsaket et unntak under aktivering.
At line:29 char:17
+ … $File = Get-PnPProperty -ClientObject $Document -Property File
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : WriteError: (:) [Get-PnPProperty], TargetInvocat
ionException
+ FullyQualifiedErrorId : EXCEPTION,PnP.PowerShell.Commands.Base.EnsurePro
perty
You cannot call a method on a null-valued expression.
At line:32 char:9
+ $Bytes = $File.OpenBinaryStream()
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (:) [], RuntimeException
+ FullyQualifiedErrorId : InvokeMethodOnNull
Could you please guide me further? Thanks for writing such an amazing script. Thanks a lot in advance.
Hi.
Thank you for always helping me with your article
Most pagesize is 2000 value, what does that mean?
I’m curious about the maximum value.
How could I modify the PnP script to search specific libraries where I know duplicates exist?
Change:
#Get all Document libraries
$DocumentLibraries = Get-PnPList | Where-Object {$_.BaseType -eq “DocumentLibrary” -and $_.Hidden -eq $false -and $_.ItemCount -gt 0 -and $_.Title -Notin(“Site Pages”,”Style Library”, “Preservation Hold Library”)}
To: $DocumentLibraries = Get-PnPList “Document Library Name”
Hi Sir,
Can you publish another script to verify duplicate files in all sites (Tenant level)?
Thanks,
S
Any chance anyone already edited this script to run in batches?
Script is updated to run in batches to handle larger libraries!
Hi Salaudeen,
First, your website is a gold mine.
Second, Have You this script with pnp ?
Thanks
Post is updated with PnP PowerShell version of the script.
How would you pull a report for a site with more than 5,000 items?
You got to batch process with CAML! Refer here: SharePoint Online: How to Get All List Items from Large Lists ( >5000 Items)
First at all, thanks for your job, it’s amazing. Second, how can we change the code above to break the 5000 items limitation? I tried with your post (SharePoint Online: How to Get All List Items from Large Lists ( >5000 Items)), but I’m not able.
Thank you!
Post is now updated with the scripts to run in batches – to handle larger libraries!
Hi
Thanks for this useful article!
For more than 5000 objects, can also be used a quick Pnp commandlet:
Get-PnPListItem -List $List -PageSize 5000
Right, PnP Supports batch processing List Items natively! But in CSOM you have to change your script a bit as given in the above comment.