New to Python/pandas.
Within a column called “URL”, I am trying to replace any URLs that have “http://”, “https://”, or “www.” and just keep everything after it.
For example,
http://www.jhu.edu
http://www.brown.edu
http://https://www.amherst.edu
A New Name in Drama
Should look like:
jhu.edu
brown.edu
amherst.edu
usc.edu
4
# example
import pandas as pd
data = {'colA': ['http://www.jhu.edu', 'http://www.brown.edu', 'http://https://www.amherst.edu', 'http://www.usc.edu']}
df = pd.DataFrame(data)
use str.replace with regex
out = df['colA'].str.replace(r'https?://|www.', '', regex=True)
0
In general, one can use .replace()
to replace or remove (i.e., replace with the empty string) targeted strings in python. But maybe one should be more careful when reformatting urls, as a url might end in someTextwww.org or arbitraryTextwww.com etc. That is, a reckless .replace()
method would only be appropriate if you already know it is safe to simply remove every instance of a substring, like “www.” for example.
Here are two functions that accomplish the reformatting you are after. The first one string_formatter
does the more reckless .replace()
method, whereas the second is more careful (and should solve your problem in general), removing only the target strings that appear as prefixes.
# Here are the functions:
def string_formatter(stringList_to_format,subStrings_to_remove):
newStringList=stringList_to_format
for j in subStrings_to_remove:
newStringList = [k.replace(j, '') for k in newStringList]
return newStringList
def careful_url_formatter(stringList_to_format,prefix_strings_to_remove):
newStringList=stringList_to_format
loopSwitch=False
while loopSwitch==False:
checkList=newStringList
for j in prefix_strings_to_remove:
newStringList = [k[len(j):] if k[0:len(j)]==j else k for k in newStringList]
loopSwitch=checkList==newStringList
return newStringList
# Here is how to apply them to your URL formatting problem:
# collect your URLs in a list
original_url_list =["http://harvard.edu", "http://https://www.harvard.edu","https://stackoverflow.com","www.https://aTougherExampleEndingInwww.org"]
# collect the substrings to remove in a list:
substringsToRemove=['http://','www.','https://']
# Apply the correct reformatting function. Here, the example urls show how the first reformatting function might screw up, while the second is fine:
new_url_list_wrong=string_formatter(original_url_list,substringsToRemove)
new_url_list_correct=careful_url_formatter(original_url_list,substringsToRemove)
print(new_url_list_wrong,new_url_list_correct)
Notice the careful_url_formatter
has a while loop, which is applied to handle cases where you have to remove several of the prefixes in sequence, as appeared in your example.