SharePoint Online: Find Duplicate Files using PowerShell
Requirement: Find Duplicate Documents in SharePoint Online.
When multiple people from different teams work together, there is a huge possibility for duplicate content in SharePoint. People may have uploaded a same document to different libraries or even different folders in document libraries. So, How to find duplicate documents in SharePoint Online?
SharePoint Online: Find Duplicate Documents using PowerShell - File Hash Method
How to find duplicate files in SharePoint Online? Lets find duplicate files in a SharePoint Online document library by comparing file hash:
PowerShell to Find All Duplicate Files in a Site (Compare Hash, File Name and File Size)
This PowerShell script scans all files from all document libraries in a site and extracts the File Name, File Hash and Size parameters for comparison, Outputs a CSV report with all data.
When multiple people from different teams work together, there is a huge possibility for duplicate content in SharePoint. People may have uploaded a same document to different libraries or even different folders in document libraries. So, How to find duplicate documents in SharePoint Online?
SharePoint Online: Find Duplicate Documents using PowerShell - File Hash Method
How to find duplicate files in SharePoint Online? Lets find duplicate files in a SharePoint Online document library by comparing file hash:
#Load SharePoint CSOM Assemblies Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll" Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll" #Parameters $SiteURL = "https://crescenttech.sharepoint.com" $ListName ="Documents" #Array to Results Data $DataCollection = @() #Get credentials to connect $Cred = Get-Credential Try { #Setup the Context $Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL) $Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password) #Get the Web $Web = $Ctx.Web $Ctx.Load($Web) #Get all List items from the library - Exclude "Folder" objects $List = $Ctx.Web.Lists.GetByTitle($ListName) $Query = New-Object Microsoft.SharePoint.Client.CamlQuery $Query.ViewXml="<View Scope='RecursiveAll'><Query><Where><Eq><FieldRef Name='FSObjType'/><Value Type='Integer'>0</Value></Eq></Where></Query></View>" $ListItems = $List.GetItems($Query) $Ctx.Load($ListItems) $Ctx.ExecuteQuery() $Count=1 ForEach($Item in $ListItems) { #Get the File from Item $File = $Item.File $Ctx.Load($File) $Ctx.ExecuteQuery() Write-Progress -PercentComplete ($Count / $ListItems.Count * 100) -Activity "Processing File $count of $($ListItems.Count)" -Status "Scanning File '$($File.Name)'" #Get The File Hash $Bytes = $Item.file.OpenBinaryStream() $Ctx.ExecuteQuery() $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value)) #Collect data $Data = New-Object PSObject $Data | Add-Member -MemberType NoteProperty -name "File Name" -value $File.Name $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl $DataCollection += $Data $Count++ } #$DataCollection #Get Duplicate Files $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group If($Duplicates.Count -gt 1) { $Duplicates | Out-GridView } Else { Write-host -f Yellow "No Duplicates Found!" } } Catch { write-host -f Red "Error:" $_.Exception.Message }However, this method does not work for Office documents like .docx, .pptx, .xlsx, etc. because the metadata for Office documents in SharePoint is stored within the document itself whereas for other document types the metadata is stored in the SharePoint content database. So, when you upload a same Office document twice, their metadata like "Created On" differs!
PowerShell to Find All Duplicate Files in a Site (Compare Hash, File Name and File Size)
This PowerShell script scans all files from all document libraries in a site and extracts the File Name, File Hash and Size parameters for comparison, Outputs a CSV report with all data.
#Load SharePoint CSOM Assemblies Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.dll" Add-Type -Path "C:\Program Files\Common Files\Microsoft Shared\Web Server Extensions\16\ISAPI\Microsoft.SharePoint.Client.Runtime.dll" #Parameters $SiteURL = "https://crescenttech.sharepoint.com" $CSVPath = "C:\Temp\FilesInventory.csv" #Array for Result Data $DataCollection = @() #Get credentials to connect $Cred = Get-Credential Try { #Setup the Context $Ctx = New-Object Microsoft.SharePoint.Client.ClientContext($SiteURL) $Ctx.Credentials = New-Object Microsoft.SharePoint.Client.SharePointOnlineCredentials($Cred.UserName, $Cred.Password) #Get the Web $Web = $Ctx.Web $Lists = $Web.Lists $Ctx.Load($Web) $Ctx.Load($Lists) $Ctx.ExecuteQuery() #Iterate through Each List on the web ForEach($List in $Lists) { #Filter Lists If($List.BaseType -eq "DocumentLibrary" -and $List.Hidden -eq $False -and $List.Title -ne "Site Pages") { #Get all List items from the library - Exclude "Folder" objects $Query = New-Object Microsoft.SharePoint.Client.CamlQuery $Query.ViewXml="<View Scope='RecursiveAll'><Query><Where><Eq><FieldRef Name='FSObjType'/><Value Type='Integer'>0</Value></Eq></Where></Query></View>" $ListItems = $List.GetItems($Query) $Ctx.Load($ListItems) $Ctx.ExecuteQuery() $Count=1 ForEach($Item in $ListItems) { #Get the File from Item $File = $Item.File $Ctx.Load($File) $Ctx.ExecuteQuery() Write-Progress -PercentComplete ($Count / $ListItems.Count * 100) -Activity "Processing File $count of $($ListItems.Count) in $($List.Title) of $($Web.URL)" -Status "Scanning File '$($File.Name)'" #Get The File Hash $Bytes = $Item.file.OpenBinaryStream() $Ctx.ExecuteQuery() $MD5 = New-Object -TypeName System.Security.Cryptography.MD5CryptoServiceProvider $HashCode = [System.BitConverter]::ToString($MD5.ComputeHash($Bytes.Value)) #Collect data $Data = New-Object PSObject $Data | Add-Member -MemberType NoteProperty -name "FileName" -value $File.Name $Data | Add-Member -MemberType NoteProperty -Name "HashCode" -value $HashCode $Data | Add-Member -MemberType NoteProperty -Name "URL" -value $File.ServerRelativeUrl $Data | Add-Member -MemberType NoteProperty -Name "FileSize" -value $File.Length $DataCollection += $Data $Count++ } } } #Export All Data to CSV $DataCollection | Export-Csv -Path $CSVPath -NoTypeInformation Write-host -f Green "Files Inventory has been Exported to $CSVPath" #Get Duplicate Files by Grouping Hash code $Duplicates = $DataCollection | Group-Object -Property HashCode | Where {$_.Count -gt 1} | Select -ExpandProperty Group Write-host "Duplicate Files Based on File Hashcode:" $Duplicates | Format-table -AutoSize #Group Based on File Name $FileNameDuplicates = $DataCollection | Group-Object -Property FileName | Where {$_.Count -gt 1} | Select -ExpandProperty Group Write-host "Potential Duplicate Based on File Name:" $FileNameDuplicates| Format-table -AutoSize #Group Based on File Size $FileSizeDuplicates = $DataCollection | Group-Object -Property FileSize | Where {$_.Count -gt 1} | Select -ExpandProperty Group Write-host "Potential Duplicates Based on File Size:" $FileSizeDuplicates| Format-table -AutoSize } Catch { write-host -f Red "Error:" $_.Exception.Message }
How would you pull a report for a site with more than 5,000 items?
ReplyDeleteYou got to batch process with CAML! Refer here: SharePoint Online: How to Get All List Items from Large Lists ( >5000 Items)
DeleteHi
DeleteThanks for this useful article!
For more than 5000 objects, can also be used a quick Pnp commandlet:
Get-PnPListItem -List $List -PageSize 5000
Right, PnP Supports batch processing List Items natively! But in CSOM you have to change your script a bit as given in the above comment.
DeleteHi Salaudeen,
ReplyDeleteFirst, your website is a gold mine.
Second, Have You this script with pnp ?
Thanks