I have more or less a simple key-value database that maps a path to a Debian package name. To normalize the database and save disk space, the package names are put into a packages
table and referred to them via an ID:
CREATE TABLE packages (
id integer PRIMARY KEY NOT NULL,
package text NOT NULL UNIQUE
);
CREATE TABLE path_package (
path TEXT PRIMARY KEY NOT NULL,
package_id integer NOT NULL,
FOREIGN KEY (package_id) REFERENCES packages(id)
);
Adding a new key-value pair needs two to three SQL statements in the most naive implementation:
import sqlite3
class Path2Package(collections.abc.Mapping[str, str]):
def __init__(self, database: str) -> None:
self.connection = sqlite3.connect(database)
def __setitem__(self, path: str, package: str) -> None:
found = self.connection.execute(
"SELECT id from packages WHERE package = ?", (package,)
).fetchone()
if found is None:
found = self.connection.execute(
"INSERT INTO packages (package) VALUES(?) RETURNING id", (package,)
).fetchone()
assert found is not None
package_id = found[0]
self.connection.execute(
"INSERT INTO path_package VALUES(?, ?) "
"ON CONFLICT(path) DO UPDATE SET package_id=excluded.package_id",
(path, package_id),
)
Since this table can have million of entries, creating those entries is performance relevant. I want to keep the amount of Python code small since Python is much slower than the sqlite3 module code (written in C?). Is there a way to combine those SQL statements into one so that can be used with connection.executemany
?
Context: I want a fast implementation that also works on low memory systems and consume less on-disk space. This is to fix apport-retrace needs more than 1.5 GB memory (when using sandbox). Different implementations and benchmark results can be found on https://github.com/bdrung/apport/blob/benchmark-databases/Benchmark.md.