I’m working with a PostgreSQL query where I need to extract the main domain (including the second-level domain and top-level domain) from URLs. However, the current approach using REGEXP_MATCH or SUBSTRING doesn’t handle all cases correctly. For example, it fails to parse URLs like {www.paylink.fr:443} (which should be paylink.fr) and {api.ornage.com} (which should be ornage.com)
SELECT
"Payments"."Id",
REGEXP_MATCH("ClientRedirectUrl", '^https?://(?:[^/]+\.)?([^/]+)') AS "PaymentDomain",
REGEXP_MATCH("ClientUri", '^https?://(?:[^/]+\.)?([^/]+)') AS "ClientDomain",
REGEXP_MATCH("Site", '^https?://(?:[^/]+\.)?([^/]+)') AS "UserDomain"
FROM
"public"."Payments"
LEFT JOIN "public"."AppUsers" ON "Payments"."UserId" = "AppUsers"."Id"
LEFT JOIN "public"."Clients" ON "Payments"."ClientId" = "Clients"."ClientId"
WHERE
"PayDate" >= (NOW() + INTERVAL '-3 day')
SELECT
"Payments"."Id",
SUBSTRING("ClientRedirectUrl" FROM '^https?://(?:[^/]+\.)?([^/]+)') AS "PaymentDomain",
SUBSTRING("ClientUri" FROM '^https?://(?:[^/]+\.)?([^/]+)') AS "ClientDomain",
SUBSTRING("Site" FROM '^https?://(?:[^/]+\.)?([^/]+)') AS "UserDomain"
FROM
"public"."Payments"
LEFT JOIN "public"."AppUsers" ON "Payments"."UserId" = "AppUsers"."Id"
LEFT JOIN "public"."Clients" ON "Payments"."ClientId" = "Clients"."ClientId"
WHERE
"PayDate" >= (NOW() + INTERVAL '-3 day')
How can I modify this query to correctly extract the main domains without any subdomains, ports, or query parameters?
Here is a sample of unparsed value:
ClientRedirectUrl: https://sonyco.it/wc-api/WC_paylink/?wc_order=673927
ClientUri: https://www.sonyco.it
Site: https://yzx80.paylink.net/