I have 2 data sets. Set1 has roughly 129k lines, has an ID column and, importantly, has the 2nd column code, which I will need later. Set2 has only has the ID column. Set2 also has about 83k lines, which should all be within Set1. Both files/datasets should be in order by ID, if that matters. Both should be as strings, but all IDs are padded to 10 digits long.
Set1=
0000000002,0
0000012345,1
0000023456,0
0000012345,2
0000056789,1
0000034567,0
0000078908,1
Set2=
0000000002
0000012345
0000056789
0000034567
I am trying to find all IDs within Set1 that are not in Set2, including that 2nd column for each ID from dataset 1. I also need all IDs within Set1 that are in Set2, also including that 2nd column for each ID.
I also need subsets for each of those:
-
All IDs within Set1 that are not in Set2, including that code column, separated by the code 0,1,2.
-
All IDs within Set1 that are in Set2, including that code column, separated by the code 0,1,2.
The main issues I am having are the fact that the datasets are mismatched in that the main dataset has 2 columns, and the comparing dataset only has 1. And the fact that the datasets are so large, that I am trying to do this efficiently so it doesn’t take a large amount of time, since this is a daily running program, and has other things to run after I get these subset lists.
-
I have tried to import as arrays and compare them to get the sets of data I need, but it takes WAY too long to be practical. I am talking like 1 hour+. that is not even getting to the point that I need to subset them based on the 2nd column.
$IDsNotIn2 = $Array1 | Where {$Array2 -NotContains $_}
-
I have tried to use hashtables, and could not get the compares to work. I imported the csv’s as hashtables. Then called Compare-Object. They returned nothing / showed nothing when called.
import-CSV -Path $FILE_LOC | ForEach-Object { $hashtable1[$_.ID] = $_.Code } import-CSV -Path $FIXED_FILE_LOC -Header ID | ForEach-Object { $hashtable2[$_.ID] } Compare-Object -ReferenceObject $hashtable1 -DifferenceObject $hashtable2 -Property ID -PassThru | ? {$_.SideIndicator -eq "<="} | Select-Object -Property * -ExcludeProperty SideIndicator Compare-Object -ReferenceObject $hashtable1 -DifferenceObject $hashtable2
-
I have tried .GetEnumerator and iterating over the first hashtable and if the key exists in the 2nd, creating a 3nd table with the new values. But for whatever reason, all the values from the first table get put into the new table.
$hashtable1.GetEnumerator() | ForEach-Object { IF (-Not ($hashtable2.ContainsKey($_.key))){ $ID = $_.key $columncode = $_.value $ID_Not_in_First_File[$ID] = @{ $Code= $columncode } } }
Zeke S is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.