有没有在一个目录十几万个文件里面搜索所有相同的文件方法 (已找到很多解决方案)
- BigSnake.NET
- 帖子: 12522
- 注册时间: 2006-07-02 11:16
- 来自: 廣州
- 联系:
先统计文件大小
然后对有多个相同大小的做一次md5sum,再排序分类
然后对相同md5的再逐个diff
代码: 全选
du -ab *|sort
然后对相同md5的再逐个diff
^_^ ~~~
要理解递归,首先要理解递归。
地球人都知道,理论上,理论跟实际是没有差别的,但实际上,理论跟实际的差别是相当大滴。
要理解递归,首先要理解递归。
地球人都知道,理论上,理论跟实际是没有差别的,但实际上,理论跟实际的差别是相当大滴。
- xiehuoli
- 帖子: 5941
- 注册时间: 2006-06-10 8:43
- 来自: 中国 CS
- eexpress
- 帖子: 58428
- 注册时间: 2005-08-14 21:55
- 来自: 长沙
- bones7456
- 帖子: 8495
- 注册时间: 2006-04-12 20:05
- 来自: 杭州
- 联系:
- xiehuoli
- 帖子: 5941
- 注册时间: 2006-06-10 8:43
- 来自: 中国 CS
- yiding_he
- 帖子: 2677
- 注册时间: 2006-10-25 18:10
- 来自: 长沙
- 联系:
- eexpress
- 帖子: 58428
- 注册时间: 2005-08-14 21:55
- 来自: 长沙
- xiehuoli
- 帖子: 5941
- 注册时间: 2006-06-10 8:43
- 来自: 中国 CS
- xiehuoli
- 帖子: 5941
- 注册时间: 2006-06-10 8:43
- 来自: 中国 CS
-
- 帖子: 1187
- 注册时间: 2006-04-29 14:54
- 来自: 山东
- 联系:
相同文件名的可以用find,相同内容的,感觉还是上面那个先比较大小,再在相同大小的文件中比较这种思路要好的多,如果觉得md5sum慢,也可以尝试用od比较一下某部分的十六进制值,比如最后一行什么的。
这个思路也不知道行不行,我不太懂。
另外在freshmeat上看到的几个:
这个思路也不知道行不行,我不太懂。
另外在freshmeat上看到的几个:
Dupseek finds and interactively removes duplicate files. It aims at maximum efficiency by keeping file reads to a minimum and is much better than other similar programs when dealing with groups of large files of the same size.
FDUPES is a program for identifying or deleting duplicate files residing within specified directories.
DupeFinder is a simple application for locating, moving, renaming, and deleting duplicate files in a directory structure. It's perfect both for users who haven't kept their hard drives very well organized and need to do some cleaning to free space, and for users who like to keep lots of backup copies of important data "just in case" something bad should happen.
weedit is a file duplicate scanner with database support. It uses CRC32, MD5, and file size to scan for duplicates. Files that are deleted are automatically removed from the database when a duplicate is found. It will only rescan files if the creation time or last write time change. It will only delete duplicated files if the parameter for deleting is used. The default setting is to report only.
whatpix is a Perl console application which finds (and optionally moves or deletes) duplicate files.
dupliFinder is a graphical tool that searches directories on your computer for duplicate files by checking and comparing the MD5 sum of each file. This means that the contents of the file are examined, not the filename. You then have the option of reviewing the duplicate files and then deleting them. It's great for finding duplicates in your MP3, image, or movie collections.
- oneleaf
- 论坛管理员
- 帖子: 10441
- 注册时间: 2005-03-27 0:06
- 系统: Ubuntu 12.04
不好意思,看到迟了
写了一个python的脚本,速度应该是比较快的,没有利用md5,直接利用了crc32,先检查出同样尺寸的文件,再计算crc32,得出相同的文件名列表。
代码: 全选
#!/usr/bin/env python
#coding=utf-8
import binascii,os
filesizes={}
samefiles=[]
def filesize(path):
if os.path.isdir(path):
files=os.listdir(path)
for file in files:
filesize(path+"/"+file)
else:
size=os.path.getsize(path)
if not filesizes.has_key(size):
filesizes[size]=[]
filesizes[size].append(path)
def filecrc(files):
filecrcs={}
for file in files:
f=open(file,'r')
crc = binascii.crc32(f.read())
f.close()
if not filecrcs.has_key(crc):
filecrcs[crc]=[]
filecrcs[crc].append(file)
for filecrclist in filecrcs.values():
if len(filecrclist)>1:
samefiles.append(filecrclist)
if __name__ == "__main__":
filesize("/home/oneleaf/test/")
for sizesamefilelist in filesizes.values():
if len(sizesamefilelist)>1:
filecrc(sizesamefilelist)
for samefile in samefiles:
print "******* same files group **********"
for file in samefile:
print file
写了一个python的脚本,速度应该是比较快的,没有利用md5,直接利用了crc32,先检查出同样尺寸的文件,再计算crc32,得出相同的文件名列表。
- xiehuoli
- 帖子: 5941
- 注册时间: 2006-06-10 8:43
- 来自: 中国 CS
- bones7456
- 帖子: 8495
- 注册时间: 2006-04-12 20:05
- 来自: 杭州
- 联系: