As a background, I have a database that I’m currently checking for duplicates. I have a file that contains a list of ids that have more than 1 row counts, but the problem is, it also counts the duplicates for the duplicate rows. For example, if Row 1 has a duplicate in Row 2 and Row 3, it would also analyze Row 2 and say that there’s a duplicate in Row 1 and 3.
To solve this, I made a script that will go through the id’s, execute a query to get the necessary fields, and execute another query using the fields to find the duplicate rows.
The ids of these rows will be saved an array, and then each subsequent id will be checked if it’s in the array. If they are, they will be skipped.
The problem is that the value in array
checking doesn’t work. Here is the code:
import psycopg2
import psycopg2.extensions
import csv
count = 0
total_duplicates = 0
duplicate_ids = []
try:
connection = psycopg2.connect(user="postgres",
password="postgres",
host="127.0.0.1",
port="5432",
database="my_database")
cursor = connection.cursor()
with open("duplicate_entries_id_dupes.csv", "r") as f:
reader = csv.reader(f, delimiter="t")
for i, line in enumerate(reader):
# Get the row data here. Get the ID.
entry = line[0]
entry_split = entry.split(',')
id = entry_split[0]
# If the id of the row is in the duplicate id, skip it already
if id in duplicate_ids:
print(f"*** id {id} of entry is in duplicate array, skipping ***")
continue
else:
print(f"id {id} is not in duplicate id array, will continue")
# Use the id, get the rest of the data here
fetch_query = f"my_query_here"
cursor.execute(fetch_query)
# Get the timestamp, vessel, hardware_id, scope, message, fields
record = cursor.fetchone()
# Get the details here.
dupe_finder_query = f"SELECT id FROM my_table_here
WHERE my_conditions_here
AND id != {id}
ORDER BY id DESC"
cursor.execute(dupe_finder_query)
dupe_finder_result = cursor.fetchall()
dupe_count = len(dupe_finder_result)
total_duplicates = total_duplicates + dupe_count
print(f"following data with id {id} has {dupe_count} duplicate(s):")
for dupe in dupe_finder_result:
dupe_id = dupe[0]
print(f" duplicate found with row id {dupe_id}")
duplicate_ids.append(dupe_id)
# print(f"Current duplicate id array: {duplicate_ids}")
print(f"Total Number of Duplicate Rows = {total_duplicates}.")
except (Exception, psycopg2.Error) as error:
print("Error while fetching data from PostgreSQL", error)
finally:
# closing database connection.
if connection:
cursor.close()
connection.close()
I printed the array just to check if it was populated correctly, and it was. I even copied the array and ran a small test:
test_array = [array with values]
input = int_in_array
if input in test_array:
print(f"input is in test array")
else:
print(f"input NOT in test array")
And it was working as intended.
Not sure why it doesn’t work once it was in the loop.