I have a large dataset (400K+ rows by 4 columns). I’m currently grabbing the data into an array using the following (thanks to tanaike):
function getValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000 }) {
return [...Array(Math.ceil(maxRow / limit))].flatMap((_) => {
const last = start - 1 + limit;
const range = `'${sheetName}'!A${start}:${last > maxRow ? maxRow : last}`;
const temp = Sheets.Spreadsheets.Values.get(spreadsheetId, range).values;
start += limit;
return temp;
});
}
function setValues_({ spreadsheetId, sheetName, start = 1, maxRow, limit = 100000, values }) {
Array.from(Array(Math.ceil(maxRow / limit))).forEach((_) => {
const v = values.splice(0, limit);
Sheets.Spreadsheets.Values.update({ values: v }, spreadsheetId, `'${sheetName}'!A${start}`, { valueInputOption: "USER_ENTERED" });
start += limit;
});
}
The first function grabs data 100K rows at a time and places it into an array. The second function pastes data 100K rows at a time. Works great. I then do some work on it via loops.
I now find myself needing to remove duplicates. Here is what my data looks like:
ID | Content | Completion Date | Pathway |
---|---|---|---|
1 | abc | 01/01/2024 | Apple |
1 | def | 01/01/2024 | Apple |
1 | ghi | 01/01/2024 | Apple |
1 | def | 01/11/2024 | Apple |
1 | abc | 01/01/2023 | Apple |
1 | abc | 01/01/2024 | Apple |
I would like to remove those rows that are duplicates based on ID, Content and Pathway…keeping the oldest date in the Completion Date column. It should look like this:
ID | Content | Completion Date | Pathway |
---|---|---|---|
1 | def | 01/01/2024 | Apple |
1 | ghi | 01/01/2024 | Apple |
1 | abc | 01/01/2023 | Apple |
How would you approach this?
- Paste data into the sheet, sort and THEN use
.removeDuplicates()
? - Remove duplicates within the array with loop and then paste into the sheet?
You should know:
- I already have the data in an array because I am looping
through it to remove certain rows that have specific data in the
Content column. - With having so many rows I’m looking for the fastest solution possible.
- I have no problem trying to figure this out on my own if you can simply answer my question above, but a solution that provides a script would be appreciated, especially if the best method is using a loop. I can manage with simple datasets but one this size presents a challenge for me.