Tool to convert Article to summary and audio with translation feature

Do not miss this exclusive book on Binary Tree Problems. Get it now for free.

This article at OpenGenus shall aim to serve as an introduction to a prototype of the Article Convertor, Summarizer and Translator Tool developed by Ambarish Deb for OpenGenus IQ, as part of internship program. We shall go through the entire scope of the product, starting from the need behind said product, it's initialization and then each of its use cases and their implementation.

Role Fulfilled by this Tool

Nowadays, there is no dearth of quality articles available online on multiple subjects that one may choose to familiarize themselves with. However, with the hectic schedule of the daily life, one may not find enough time every day to dedicate to reading articles and instead be more comfortable to have a voice narrate the article text so that they're able to consume content on the go. Not to mention that it is much more difficult for people with vision problems to stay up to date by reading and they also can reap the benefits of this product.

This product also brings the abillity to translate content written in other languages to the table, threby increasing the reach of the user to articles/content in foreign languages as well.

Initialization

Let's start with the basics. To be able to run this product, you should be able to write and run python code on your system. Optionally,you can also install Jupyter Notebooks for a smoother experience while running Python code.

Before we move on to the product, there are certain requirements for this project which need to be installed. You may refer here for a tutorial on systemwide installation, and do a localised installation to jupyter notebooks, if you choose to use the latter.

Once we are done with installation of the requirements, we can proceed further.

Use Cases

We shall be looking at the implementation of each use case of the product. After that, we shall ensure that these implementations can be accessed by a layperson through the command line through certain commands, to provide for a no code approach to use this product.

Convert URL contents to raw audio-

Firstly, considering the fact that a user may enter both absolute or relative addresses for the input & output files, we need to ensure that the system is able to parse both types of addresses. For this, we will define a function to process file paths. If the address is found to be relative, we can prompt the user to enter the full address till the working directory, else we can move forward with the provided addresses.

import os
def process_file_path(file_path):
    # Convert the file path to an absolute path if it is relative
    if not os.path.isabs(file_path):
        linkingpath = input("Enter full path till BEFORE required location/file separated by double forward slashes(//): ")
        os.chdir(linkingpath)
        current_dir = os.getcwd()  # Get the current working directory
        full_file_path = os.path.join(current_dir, file_path)  # Create the absolute path
    else:
        full_file_path=file_path
    return full_file_path

For this use case, the flow of tasks shall be-

1. Defining the function and parameters

We define the identifier and the input parameters for this function.

def url_to_raw_audio(input_urls_file,output_file_location):

2. Importing the required libraries

    from gtts import gTTS
    import requests
    from bs4 import BeautifulSoup
    import re
    import os

3. Reading the links

We process the addresses provided as parameters, and then open the input file. We thenn separate out the lines provided in the input file.

    input_urls_file=process_file_path(input_urls_file)
    output_file_location=process_file_path(output_file_location)
    with open(input_urls_file,'r') as file_in:
        urls = []
        for line in file_in:
            urls.append(line.strip())

4. Web Scraping

We take each URL, and then scrape the contents of that page using BS4.

    i=1
     #Make HTTP request to the URL
    for url in urls:
        if len(urls)==0:
            print("No URLS in input.txt")
            break
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')

5. Filtering out unnecessary content

The contents have a lot of unnecesaary material and we're only interested in the text content of the articles. Hence we will filter them out.

  
   unnecessary_tags=['script','a', 'style', 'table', 'iframe', 'aside','pre','ins','li','ul','google-auto-placed ap_container','L-Affiliate-Tagged']
 for elem in soup(unnecessary_tags):
      elem.extract()
      
      text = soup.get_text()
      text = ''.join(text).strip()
      topic_name = soup.find('h1',class_='post-full-title')`

6. Structuring

We now provide an opening and closing statement to the text.

    intro_text = f'Today we will learn about {topic_name.text} 
    
    outro_text=f'\nThis was all about {topic_name.text} . For code files and additional information, please visit www.iq.opengenus.org/'
    `
    full_text = intro_text+'\n'+text+'\n'+outro_text

7. Conversion to Text and then to Audio

We now write the full text to a txt file, and then use the gTTS library to transcribe the text to audio. It takes the parameters text - specifying the text to be transcribed, language- specifying the language of the text, and slow - whether the speed of the audio should be slow. We repeat the procedure until all the links have been converted to audio.

    with open(topic_name.get_text()+'.txt', 'a') as f:
        for word in full_text:
            try:
                f.write(word)
            except:
                f.write('error'+"\n")
    name=topic_name.get_text()+" raw.mp3"
    newpath2=output_file_location.split("//")
    newpath=output_file_location.split("//")
    newpath.append(name)
    output_file_location="\\".join(newpath)
    language = 'en'
    speech = gTTS(text=full_text, lang=language, slow=False)
    speech.save(output_file_location)
    os.system("mpg321 hello.mp3")
    output_file_location="//".join(newpath2)
    print(f"Converted URL {i} to raw audio")
    i+=1
print("Converted all URLs to raw audio")

Convert URL contents to summarised audio-

For this use case, the flow of tasks shall be similar to the previous case, except in one place where we summarise the text. The flow shall be-

**1. Defining the function and parameters
**

def url_to_raw_audio(input_urls_file,output_file_location):

2. Importing the required libraries

    from gtts import gTTS
    import requests
    from bs4 import BeautifulSoup
    import re
    import os

3. Reading the links

    input_urls_file=process_file_path(input_urls_file)
    output_file_location=process_file_path(output_file_location)
    with open(input_urls_file,'r') as file_in:
        urls = []
        for line in file_in:
            urls.append(line.strip())

4. Web Scraping

    i=1
     #Make HTTP request to the URL
    for url in urls:
        if len(urls)==0:
            print("No URLS in input.txt")
            break
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')

5. Filtering out unnecessary content

   unnecessary_tags=['script','a', 'style', 'table', 'iframe', 'aside','pre','ins','li','ul','google-auto-placed ap_container','L-Affiliate-Tagged']
 for elem in soup(unnecessary_tags):
      elem.extract()
      text = soup.get_text()
      text = ''.join(text).strip()
      topic_name = soup.find('h1',class_='post-full-title')`

6. Summarization

We use PlaintextParser library to summarize the language. The second parameter specifies the number of lines the text has to be summarised in to. Eg. if it is 10, it means that the whole text will be reduced to 10 lines.

        parser = PlaintextParser.from_string(text, Tokenizer("english"))
        summarizer = TextRankSummarizer()
        summary = summarizer(parser.document, 10)
        text_summary = ""
        for sentence in summary:
            text_summary += str(sentence)

7. Structuring

``` python
intro_text = f'Today we will learn about {topic_name.text} '

outro_text=f'\nThis was all about {topic_name.text} . For code files and additional information, please visit www.iq.opengenus.org/'
`
full_text = intro_text+'\n'+text_summary+'\n'+outro_text

8. Conversion to Text and then to Audio

name=topic_name.get_text()+" summarised.mp3"
        newpath2=output_file_location.split("//")
        newpath=output_file_location.split("//")
        newpath.append(name)
        output_file_location="\\".join(newpath)
        language = 'en'
        speech = gTTS(text=text_summary, lang=language, slow=False)
        speech.save(output_file_location)
        os.system("mpg321 hello.mp3")
        output_file_location="//".join(newpath2)
        print(f"Converted URL {i} to summarised audio")
        i+=1
    print("Converted all URLs to summarised audio")

The following use cases are just partial implementations of the two specified above.

Convert URL contents to raw text-

For this use case, the flow of tasks shall be-

1. Defining the function and parameters

def url_to_raw_audiotext(input_urls_file,output_file_location):

2. Importing the required libraries

    from gtts import gTTS
    import requests
    from bs4 import BeautifulSoup
    import re
    import os

3. Reading the links

    input_urls_file=process_file_path(input_urls_file)
    output_file_location=process_file_path(output_file_location)
    with open(input_urls_file,'r') as file_in:
        urls = []
        for line in file_in:
            urls.append(line.strip

4. Web Scraping

    i=1
     #Make HTTP request to the URL
    for url in urls:
        if len(urls)==0:
            print("No URLS in input.txt")
            break
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')

5. Filtering out unnecessary content

unnecessary_tags=['script','a', 'style', 'table', 'iframe', 'aside','pre','ins','li','ul','google-auto-placed ap_container','L-Affiliate-Tagged']                                                                        for elem in soup(unnecessary_tags):
        elem.extract()
        text = soup.get_text()
        text = ''.join(text).strip()
        topic_name = soup.find('h1',class_='post-full-title')

6. Structuring

        intro_text = f'Today we will learn about {topic_name.text} '
        
        outro_text=f'\nThis was all about {topic_name.text} . For code files and additional information, please visit www.iq.opengenus.org/'
        `
        full_text = intro_text+'\n'+text+'\n'+outro_text

7. Conversion to Text

        with open(topic_name.get_text()+'.txt', 'a') as f:
            for word in full_text:
                try:
                    f.write(word)
                except:
                    f.write('error'+"\n"
        i+=1
    print("Converted all URLs to raw text")
    input_urls_file="ENTER ADDRESS INCLUDING FILENAME SEPARATED BY DOUBLE SLASHES(\\)"                                               output_file_location="ENTER ADDRESS TO THE WORKING DIRECTORY SEPARATED BY DOUBLE SLASHES(\\)"                    url_to_raw_text(input_urls_file,output_file_location)

Convert URL contents to summarised text-

For this use case, the flow of tasks shall be-

For this use case, the flow of tasks shall be-

1. Defining the function and parameters

def url_to_raw_audio(input_urls_file,output_file_location):

2. Importing the required libraries

    from gtts import gTTS
    import requests
    from bs4 import BeautifulSoup
    import re
    import os

3. Reading the links<br>

    input_urls_file=process_file_path(input_urls_file)
    output_file_location=process_file_path(output_file_location)
    with open(input_urls_file,'r') as file_in:
        urls = []
        for line in file_in:
            urls.append(line.strip

4. Web Scraping

     i=1
     #Make HTTP request to the URL
    for url in urls:
        if len(urls)==0:
            print("No URLS in input.txt")
            break
        response = requests.get(url)
        soup = BeautifulSoup(response.content, 'lxml')

`
5. Filtering out unnecessary content

   unnecessary_tags=['script','a', 'style', 'table', 'iframe', 'aside','pre','ins','li','ul','google-auto-placed ap_container','L-Affiliate-Tagged']
 for elem in soup(unnecessary_tags):
      elem.extract()
      text = soup.get_text()
      text = ''.join(text).strip()
      topic_name = soup.find('h1',class_='post-full-title')`

6. Summarization

        parser = PlaintextParser.from_string(text, Tokenizer("english"))
        summarizer = TextRankSummarizer()
        summary = summarizer(parser.document, 10)
        text_summary = ""
        for sentence in summary:
            text_summary += str(sentence)

7. Structuring

intro_text = f'Today we will learn about {topic_name.text} 

outro_text=f'\nThis was all about {topic_name.text} . For code files and additional information, please visit www.iq.opengenus.org/

full_text = intro_text+'\n'+text_summary+'\n'+outro_text

8. Conversion to Text

with open(topic_name.get_text()+'.txt', 'a') as f:
          for word in text_summary:
              try:
                  f.write(word)
              except:
                  f.write('error'+"\n")
print("Converted all URLs to summarised audio")

#running  of the function     
input_urls_file="ENTER ADDRESS INCLUDING FILENAME SEPARATED BY DOUBLE SLASHES(\\)"
output_file_location="ENTER ADDRESS TO THE WORKING DIRECTORY SEPARATED BY DOUBLE SLASHES(\\)"
url_to_summarised_audio(input_urls_file,output_file_location)

Summarise the provided text-

For this use case, the flow of tasks shall be-

1. Define Function Name & Parameters

def text_to_summary(input_file,output_file_full_name):

2. Process Addresses and Read File

from sumy.parsers.plaintext import PlaintextParser
   from sumy.nlp.tokenizers import Tokenizer
   from sumy.summarizers.text_rank import TextRankSummarizer
   input_file=process_file_path(input_file)
   output_file_full_name=process_file_path(output_file_full_name)
   with open(input_file, 'r', encoding="utf-8") as file:
       text = file.read()

3. Summarization

 output_file = output_file_full_name.split("//")[-1]
    parser = PlaintextParser.from_string(text, Tokenizer("english"))
    summarizer = TextRankSummarizer()
    summary = summarizer(parser.document, 10)
    text_summary = ""
    for sentence in summary:
        text_summary += str(sentence)

4. Export as text

    with open(output_file_full_name, "w", encoding="utf-8") as text_file:
        text_file.write(text_summary)

Convert Audio to Video-

This use case is a bit different as it warrants the conversion of audio to video. The flow of tasks shall be-

1. Define function name and parameters

def convert_audio_to_video(input_file, output_file):

2. Process addresses and get audio from link

 input_file=process_file_path(input_file)
    output_file=process_file_path(output_file)
    audio = AudioFileClip(input_file)
    name = input_file.split("//")[-1][:-4]

3.Get the duration and frame rate of the audio

    duration = audio.duration
    fps = audio.fps

4. Instantiate a blank video clip with the same duration as the audio

 width, height = 640, 480
 fourcc = cv2.VideoWriter_fourcc(*"mp4v")
 writer = cv2.VideoWriter(output_file, fourcc, fps, (width, height))

5. For each frame of the video-

for t in np.arange(0, duration, 1/fps):

6. Create a blank frame

  frame = np.zeros((height, width, 3), dtype=np.uint8)

7. Add the text to the frame using OpenCV

    font = cv2.FONT_HERSHEY_SIMPLEX
    text_position = (int(width / 2 - len(name) * 5), int(height / 2))
    cv2.putText(frame, name, text_position, font, 1, (255, 255, 255), 2, cv2.LINE_AA)

8. Write the frame to the video

    writer.write(frame)

9. When all frames are done, release the video to the working directory.

writer.release()

Translate the provided text-

For this use case, the flow of tasks shall be-

1. Define Function Name & Parameters

def translate_text_file(input_file, output_file):

2. Provide Options for translation and/or and Read the file

We'll use the google translate API for this code. To make it easier for users, we can include an option for users to check the translation codes. Then, we can read the input file.

    from googletrans import Translator
    from googletrans import LANGUAGES
    if input("Show Supported Languages and their code? (yes/no): ").lower()=="yes":
        print("Supported Languages:")
        for code, language in LANGUAGES.items():
            print(f"{code}: {language}")
    src_lang=input("enter source language code: ")
    dest_lang=input("enter destination language code: ")
    input_file=process_file_path(input_file)
    output_file=process_file_path(output_file)
    with open(input_file, 'r', encoding='utf-8') as file:
        text = file.read()
        ```

3. Translation

    with open(output_file_full_name, "w", encoding="utf-8") as text_file:
        text_file.write(text_summary)


    translator = Translator(service_urls=['translate.google.com'])

    translated_text = translator.translate(text, src=src_lang, dest=dest_lang)

3. Export as text

   with open(output_file, 'w', encoding='utf-8') as file:
       file.write(translated_text.text)

Command Line Enabling

For easy accessibillity, we can integrate this code with the command line so that users can run them with simple commands. For this, we will need the argparse library.

Before we move on, we can define a few commands which hcan be used to invoke the respective functions.

URL to Audio - python opengenus_convertor.py --from=URL --to=audio
URL to Summarised Audio - python opengenus_convertor.py --from=URL --to=summarised audio
URL to Text - python opengenus_convertor.py --from=URL --to=text
URL to Summarised Text - python opengenus_convertor.py --from=URL --to=summarised text
Summarise Text- python opengenus_convertor.py --from=text --to=summarised text
Audio to Video - python opengenus_convertor.py --from=audio --to=video

We add the arguements -to anf -from to specify the conversion types,and --input and --output to specify the input and output locations. We then save them in variables and then, call the functions when their correspondong commands are typed.

import argparse


parser = argparse.ArgumentParser(description='Data Conversion from command line')

parser.add_argument('-f')
parser.add_argument('--from', dest='input_type', required=True,
                    help='Specify the input type: "url","text" or "audio".  If converting from text or audio, ensure to write the full name of the output file. Else just mention the directory where the output should be located.')
parser.add_argument('--to', dest='output_type', required=True,
                    help='Specify the output type: "audio","summarised audio", "text","translated text", "summarised text", "video".')
parser.add_argument('--input', dest='input_file', required=True,
                    help='Specify the input data: URL file or text content')
parser.add_argument('--output', dest='output_file', required=True,
                    help='Specify the output file path for audio or summary')


args = parser.parse_args()


input_type = args.input_type.lower()
output_type = args.output_type.lower()
input_data = args.input_file
output_file = args.output_file


if input_type == 'url' and output_type == 'audio':
    url_to_raw_audio(input_file, output_file)
elif input_type == 'url' and output_type == 'summarised audio':
    url_to_summarised_audio(input_file, output_file)
elif input_type == 'url' and output_type == 'text':
    url_to_text(input_file,output_data)
elif input_type == 'url' and output_type == 'summarised text':
     url_to_text_summary(input_file,output_data)
elif input_type == 'text' and output_type == 'audio':
    text_to_audio(input_file,output_file)
elif input_type == 'text' and output_type == 'summary':
    text_to_summary(input_file,output_file)
elif input_type == 'text' and output_type == 'translated text':
    translate_text_file(input_file,output_file)
elif input_type == 'audio' and output_type == 'video':
    convert_audio_to_video(input_data)
else:
    print('''please enter proper arguements. supported operations-\n1.url to raw_audio\n2.url to summarised_audio\n3.url to raw text\n4.url to summarised text\n5.text to audio\n6.text to summary\n7.audio to video\n8.Translate text''')

Running on the Command Line-

To summarise, we shall list the valid conversions, their command, the parameters taken as well as the links to a few sample conversions .
1.URL to Raw Audio-

For availing this use case, run the function url_to_raw_audio.

Command Line Statement: python opengenus_convertor.py --from=URL --to=audio

Parameters: 2- input_urls_file and output_file_location, both address parameters. Output file is created from scratch, any existing file with same name is overwritten.
Sample Input - a file containing a list of URLs

Sample Output - converted audio files at the working directory

2.URL to Summarised Audio

For availing this use case, run the function url_to_summarised_audio.

Command Line Statement: python opengenus_convertor.py --from=URL --to=summarised audio

Parameters: 2- input_urls_file and output_file_location, both address parameters. Output file is created from scratch, any existing file with same name is overwritten.
Sample Input - a file containing a list of URLs

Sample Output - converted and summarised audio files at the working directory

3.Url to Raw Text

For availing this use case, run the function url_to_raw_text.

Command Line Statement: python opengenus_convertor.py --from=URL --to=text

Parameters: 2- input_urls_file and output_file_location, both address parameters. Output file is created from scratch, any existing file with same name is overwritten.
Sample Input - a file containing a list of URLs

Sample Output - converted text files at the working directory

4.URL To Summarised Text

For availing this use case, run the function url_to_text_summary.

Command Line Statement: python opengenus_convertor.py --from=URL --to=summarised text

Parameters: 2- input_urls_file and output_file_location, both address parameters. Output file is created from scratch, any existing file with same name is overwritten.
Sample Input - a file containing a list of URLs

Sample Output - converted and summarized text files at the working directory

5.Text to Summary

For availing this use case, run the function text_to_summary.

Parameters: 2-Input_file,output_file_full_name. Here we need to specify name of the output file as well.
Command Line Statement: python opengenus_convertor.py --from=text --to=summarised text

Sample Input - a text file

Sample Output - summarised content at the working directory

6.Audio to Video

For availing this use case, run the function audio_to_video.

Parameters: 2- input_file, output_file, both address parameters of the input mp3 and output mp4 files, respectively. Output file is created from scratch, any existing file with same name is overwritten.
Command Line Statement: python opengenus_convertor.py --from=audio --to=video

Sample Input - an audio file

Sample Output - video file at the working directory

7.Translate text

For availing this use case, run the function translate_text_file.

Parameters: 2- input_file, output_file, both address parameters of the input and translated files, respectively. Output file is created from scratch, any existing file with same name is overwritten.
Command Line Statement: python opengenus_convertor.py --from=text --to=translated text

Sample Input - an input file in english

Expected Output - translated in french

If converting from text or audio, ensure to write the full name of the output file. Else just mention the directory where the output should be located

Link to the tool prototype - Github.

This concludes the documentation for the Article Convertor, Summarizer and Translator Tool Prototype.