Implementing Random Feature at OpenGenus IQ

Internship at OpenGenus

Get this book -> Problems on Array: For Interviews and Competitive Programming

I was tasked by OpenGenus to implement a random feature, literally. It was going to be a page that immediately redirected a user to a random opengenus article. You can check it out here. So in this article, we will be going through the process of how I developed the random feature.

Feature Specification

I was tasked by the very gracious Aditya Chatterjee as part of my internship to implement the random feature for opengenus. I was given a sitemap that contained the links to all the articles that have been published to opengenus with the directive implement a page that redirects users to another random article page.

Stages of implementation

  • firstly, I needed to have a page that was able to redirect to another page
  • secondly, I needed to get all the URLs for all the published articles on the sitemap
  • lastly, put it all together and create the page that randomly redirects to a new opengenus article

Stage one

I needed to implement the core redirection functionality, turns out that was trivial to do. I just needed to include the javascript line

window.open(link_to_otherpage,"_self") 
// in a script tag in an Html file

Stage two

I needed to extract the links from the site map. One option was to manually copy and paste all the links but this approach wouldn't be great. There were more than two thousand links. so I decided to scrape the links.

Web scraping is the process of programmatically extracting information from a webpage or group of web pages. I created a script that extracted all the links from the site map into a JSON file/object.

The Script

import json
import requests
from bs4 import BeautifulSoup


SITEMAP_URL = "sitemap"

def get_links():
    xml = requests.get(SITEMAP_URL).text
    bs_object = BeautifulSoup(xml, "xml")


    # "loc is the xml attribute that contains a url"
    # finds all the locs and extract the urls from them
    urls = bs_object.find_all("loc")
    urls = [url.get_text() for url in urls]
    return urls

def convert_articlelinks_to_json_string(article_links):
    articles = {
    "length":len(article_links),
    "links": article_links
    }
    return json.dumps(articles, indent=2)

 

Walking through the script

In the first lines, we are doing a bunch of imports, we get the json package, the requests package and beautiful soup. requests allows us to make requests to sites, beautiful soup is a fully featured library that we can use to extract data from Html and XML files.
In get_links we first request the sitemap to get the XML that contains the links, we pass the data to beautiful soup and use it to find all the links and finally, we return a list of the links. Meanwhile convert_articlelinks_to_json_string as the name implies creates a json object with the links and the length of the links as attributes.

Running the methods together; convert_articlelinks_to_json_string(get_links()) will result is this

{
 "length": 2041,
  "links": [
    "https://iq.opengenus.org/persistent-trie/",
    "https://iq.opengenus.org/time-and-space-complexity-of-heap/",
    "https://iq.opengenus.org/project-on-reconstructing-face/",
    ...
    "https://iq.opengenus.org/ssd-model-architecture/",
    "https://iq.opengenus.org/exponential-linear-unit/"]
  }

 

Stage three

If we combine everything by copying the JSON object we created in our js script and adding the mechanism for randomness we get this:

<script>
    const articles = {
 "length": 2041,
  "links": [
    "https://iq.opengenus.org/persistent-trie/",
    "https://iq.opengenus.org/time-and-space-complexity-of-heap/",
    "https://iq.opengenus.org/project-on-reconstructing-face/",
    ...
    "https://iq.opengenus.org/ssd-model-architecture/",
    "https://iq.opengenus.org/exponential-linear-unit/"]
  }

    const max_num_articles = articles.length
	let random_index = Math.floor(Math.random() * (max_num_articles + 1))
	window.open(articles.links[random_index],"_self")	
	</script>

Tada, we have satisfied the functionality. We have an Html page with the js script in it that immediately redirects you to a random opengenus article.

Refactoring / Improving the implementation

Although the above implementation satisfies the criteria for the feature there are still avenues for improvement as Aditya kindly pointed out to me. What happens when new articles are published or articles are removed. With the current implementation whenever that happens someone would have to run the script and then manually delete the old JSON object by hand and finally copy and paste in the newly created JSON object.

If we were already generating a new JSON object every time why not extend the script to also do the adding of the links directly into the html file, get rid of the copy and pasting step and just automatically recreate the html file every time with the links updated in. We could have a base template for the things in the file that would not change and dynamically inject the json object into the html and output that file.

We can have something like this for the template:

<!DOCTYPE html>
<html lang="en">

<head>
	
	<meta charset="UTF-8" />
	...
	<title>Random</title>
	<script>
        // articles will be dynamically inserted
        // articles is a object with keys; length(int) and links(array oif urls)
		const articles = {{  articles  }}
	// randomly select an article and redirect to there
	const max_num_articles = articles.length
	let random_index = Math.floor(Math.random() * (max_num_articles + 1))
	window.open(articles.links[random_index],"_self")	
	</script>
</head>
<body></body>
</html>

 

The new script will then look like this:

# simple script to get opengenus article links from its sitemap
# update populate an html template with the data

import sys, json, re
import requests
from bs4 import BeautifulSoup


SITEMAP_URL = "sitemap"
TEMPLATE_PATH = "./random_template.html"
ACTUAL_FILE_PATH = "../random.html"

def get_links():
    xml = requests.get(SITEMAP_URL).text
    bs_object = BeautifulSoup(xml, "xml")


    # "loc is the xml attribute that contains a url"
    # finds all the locs and extract the urls from them
    urls = bs_object.find_all("loc")
    urls = [url.get_text() for url in urls]
    return urls

def convert_articlelinks_to_json_string(article_links):
    articles = {
    "length":len(article_links),
    "links": article_links
    }
    return json.dumps(articles, indent=2)

def update_html(articles, template_name=TEMPLATE_PATH, file_path=ACTUAL_FILE_PATH):
    """ replaces the article slug in the template with the real json string 
        and updates the actual html
    """
    
    newcontent = ""
    with open(template_name) as file:
        newcontent = file.read()
    
    newcontent = re.sub("{{\s*articles\s*}}", articles, newcontent)
    
    with open(file_path, "w") as file:
        file.write(newcontent)

def main():
    # only 1 extra argument is allowed
    valid_args = ["--update", "--output"]
    if len(sys.argv[1:]) == 1 and sys.argv[1] in valid_args:
        links = get_links()
        articles = convert_articlelinks_to_json_string(links)
        if sys.argv[1] == "--update":
            update_html(articles)
        if sys.argv[1] == "--output":
            print(articles)
    else:
        print(f"invalid usage, accepted commands are {valid_args}")


if __name__ == "__main__":
    main()

Two new methods have been introduced in the script, update_html and main. update_html just replaces the {{ articles }} slug in the template with the actual articles and main just handles how the script is run.

So now with the final script, we could dynamically recreate our html file by running random_script.py --update and at the same time if we didn't want to update the html file yet the old functionality to just generate a list of the links is still preserved by running random_script.py --output. Now there doesn't have to be a human intermediary at all. we could just schedule the script to be run automatically on every update of sitemaps.