I have a list of image urls contained in 'images'. I am trying to isolate the title from these image urls so that I can display, on the html, the image (using the whole url) and the corresponding title.
So far I have this:
titles = [image[149:199].strip() for image in images]
This gives me the stripped title in the following format (I provide two examples to show the pattern)
le_Art_Project.jpg/220px-
Rembrandt_van_Rijn_-Self-Portrait-_Google_Art_Project.jpg
and
cene_of_the_Prodigal_Son_-Google_Art_Project.jpg/220px-Rembrandt-Rembrandt_and_Saskia_in_the_Scene_of_the_Prodigal_Son-_Google_Art_Project.jpg
The bits in bold (above) are the bits I would like to remove. From the start I would like to remove everything before 220px and from the end: _-_Google_Art_Project.jpg
A newbie to python, I am struggling with syntax and furthermore as I am doing this while referring to the loop of images (list), the string manipulation is not straightforward and I am unsure of how to approach this.
The whole code for reference is below:
webscraper.py
:
@app.route('/') #this is what we type into our browser to go to pages. we create these using routes
@app.route('/home')
def home():
images=imagescrape()
titles=[image[99:247].strip() for image in images]
images_titles=zip(images,titles)
return render_template('home.html',images=images,images_titles=images_titles)
What I've tried / am trying:
x = txt.strip("_-_Google_Art_Project.jpg")
Looking into strip - to get rid of the last part of the unwanted string.
I am unsure of how to combine this with getting rid of the leading string that I want to remove and also do so in the most elegant way given the structure/code I already have.
Visually, I am trying to remove the leading text as shown highlighted, as well as the last part of the string which is _-_Google_Art_Project.jpg.
Visual of HTML displayed:
UPDATE:
Based on an answer below - which is very helpful but doesn't quite perfectly solve it, I am trying this approach (without using the unquote import if possible and pure python string manipulation)
def titleextract(url):
#return unquote(url[58:url.rindex("/",58)-8].replace('_',''))
title=url[58:]
return title
The above, returns:
Rembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg/220pxRembrandt_van_Rijn_-_Self-Portrait_-_Google_Art_Project.jpg
but I want:
Rembrandt_van_Rijn_-_Self-Portrait
or for the second title/image in the list:
Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist_-_Google_Art_Project.jpg/220px-Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist_-_Google_Art_Project.jpg
I want:
Rembrandt_van_Rijn_-_Saskia_van_Uylenburgh%2C_the_Wife_of_the_Artist
See Question&Answers more detail:
os