Identify Duplicate Records in
UNIX
I am looking for a script/command to identify duplicate
records by certain columns in a given file and write them to an other file.
I would use the unix sort command and use the -u option
to eliminate duplicates.
Can you specify columns in sort -u ? Could u please let
me know the syntax for the following example
So when if i want to find out duplicates in column one
and want the following results in the output file listing the duplicates.
That would eliminate duplicates based on entire line.
I think he wants to output all duplicate lines, with the duplication based
not on the entire line, but only on certain columns.
They are delimited by pipe charcters. The above example
suits my requirements.
1. Check duplicates on field 1
Lot of ways to do this. My solution involves sorting by desired key(s), then feed the sorted file into awk. awk will hold each line by one cycle so that it will know if it is dealing with single or multiple lines (per key), and print the group or not. First solution below finds duplicates based on field 3. Change all five 3's in that script to 1's to find duplicates on field 1: #!/bin/sh sort -t\| -k3,3 in.txt |
function flushhold() {\
{
Use same logic to check for duplicates based upon
sort -t\| -k3,3 -k1,1 in.txt | holdkey=$3 $1 if ($3 $1==holdkey) ???holdkey=$3 $1 For awk, expressions separated by space means concatenation. You will notice in the flushhold function that this script could just as easily output only the unique lines instead of the duplicates, or both. And if outputting both, you could output unique lines to one file and duplicate lines to another.
Have a Unix Problem
Books About Unix
Unix Home: Unix System
Administration Hints and Tips
(c) www.gotothings.com All material on this site is Copyright.
|