Tool to convert Article to summary and audio with translation feature
Do not miss this exclusive book on Binary Tree Problems. Get it now for free.
This article at OpenGenus shall aim to serve as an introduction to a prototype of the Article Convertor, Summarizer and Translator Tool developed by Ambarish Deb for OpenGenus IQ, as part of internship program. We shall go through the entire scope of the product, starting from the need behind said product, it's initialization and then each of its use cases and their implementation.
Role Fulfilled by this Tool
Nowadays, there is no dearth of quality articles available online on multiple subjects that one may choose to familiarize themselves with. However, with the hectic schedule of the daily life, one may not find enough time every day to dedicate to reading articles and instead be more comfortable to have a voice narrate the article text so that they're able to consume content on the go. Not to mention that it is much more difficult for people with vision problems to stay up to date by reading and they also can reap the benefits of this product.
This product also brings the abillity to translate content written in other languages to the table, threby increasing the reach of the user to articles/content in foreign languages as well.
Initialization
Let's start with the basics. To be able to run this product, you should be able to write and run python code on your system. Optionally,you can also install Jupyter Notebooks for a smoother experience while running Python code.
Before we move on to the product, there are certain requirements for this project which need to be installed. You may refer here for a tutorial on systemwide installation, and do a localised installation to jupyter notebooks, if you choose to use the latter.
Once we are done with installation of the requirements, we can proceed further.
Use Cases
We shall be looking at the implementation of each use case of the product. After that, we shall ensure that these implementations can be accessed by a layperson through the command line through certain commands, to provide for a no code approach to use this product.
Convert URL contents to raw audio-
Firstly, considering the fact that a user may enter both absolute or relative addresses for the input & output files, we need to ensure that the system is able to parse both types of addresses. For this, we will define a function to process file paths. If the address is found to be relative, we can prompt the user to enter the full address till the working directory, else we can move forward with the provided addresses.
import os
def process_file_path(file_path):
# Convert the file path to an absolute path if it is relative
if not os.path.isabs(file_path):
linkingpath = input("Enter full path till BEFORE required location/file separated by double forward slashes(//): ")
os.chdir(linkingpath)
current_dir = os.getcwd() # Get the current working directory
full_file_path = os.path.join(current_dir, file_path) # Create the absolute path
else:
full_file_path=file_path
return full_file_path
For this use case, the flow of tasks shall be-
1. Defining the function and parameters
We define the identifier and the input parameters for this function.
def url_to_raw_audio(input_urls_file,output_file_location):
2. Importing the required libraries
from gtts import gTTS
import requests
from bs4 import BeautifulSoup
import re
import os
3. Reading the links
We process the addresses provided as parameters, and then open the input file. We thenn separate out the lines provided in the input file.
input_urls_file=process_file_path(input_urls_file)
output_file_location=process_file_path(output_file_location)
with open(input_urls_file,'r') as file_in:
urls = []
for line in file_in:
urls.append(line.strip())
4. Web Scraping
We take each URL, and then scrape the contents of that page using BS4.
i=1
#Make HTTP request to the URL
for url in urls:
if len(urls)==0:
print("No URLS in input.txt")
break
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
5. Filtering out unnecessary content
The contents have a lot of unnecesaary material and we're only interested in the text content of the articles. Hence we will filter them out.
unnecessary_tags=['script','a', 'style', 'table', 'iframe', 'aside','pre','ins','li','ul','google-auto-placed ap_container','L-Affiliate-Tagged']
for elem in soup(unnecessary_tags):
elem.extract()
text = soup.get_text()
text = ''.join(text).strip()
topic_name = soup.find('h1',class_='post-full-title')`
6. Structuring
We now provide an opening and closing statement to the text.
intro_text = f'Today we will learn about {topic_name.text}
outro_text=f'\nThis was all about {topic_name.text} . For code files and additional information, please visit www.iq.opengenus.org/'
`
full_text = intro_text+'\n'+text+'\n'+outro_text
7. Conversion to Text and then to Audio
We now write the full text to a txt file, and then use the gTTS library to transcribe the text to audio. It takes the parameters text - specifying the text to be transcribed, language- specifying the language of the text, and slow - whether the speed of the audio should be slow. We repeat the procedure until all the links have been converted to audio.
with open(topic_name.get_text()+'.txt', 'a') as f:
for word in full_text:
try:
f.write(word)
except:
f.write('error'+"\n")
name=topic_name.get_text()+" raw.mp3"
newpath2=output_file_location.split("//")
newpath=output_file_location.split("//")
newpath.append(name)
output_file_location="\\".join(newpath)
language = 'en'
speech = gTTS(text=full_text, lang=language, slow=False)
speech.save(output_file_location)
os.system("mpg321 hello.mp3")
output_file_location="//".join(newpath2)
print(f"Converted URL {i} to raw audio")
i+=1
print("Converted all URLs to raw audio")
Convert URL contents to summarised audio-
For this use case, the flow of tasks shall be similar to the previous case, except in one place where we summarise the text. The flow shall be-
**1. Defining the function and parameters
**
def url_to_raw_audio(input_urls_file,output_file_location):
2. Importing the required libraries
from gtts import gTTS
import requests
from bs4 import BeautifulSoup
import re
import os
3. Reading the links
input_urls_file=process_file_path(input_urls_file)
output_file_location=process_file_path(output_file_location)
with open(input_urls_file,'r') as file_in:
urls = []
for line in file_in:
urls.append(line.strip())
4. Web Scraping
i=1
#Make HTTP request to the URL
for url in urls:
if len(urls)==0:
print("No URLS in input.txt")
break
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
5. Filtering out unnecessary content
unnecessary_tags=['script','a', 'style', 'table', 'iframe', 'aside','pre','ins','li','ul','google-auto-placed ap_container','L-Affiliate-Tagged']
for elem in soup(unnecessary_tags):
elem.extract()
text = soup.get_text()
text = ''.join(text).strip()
topic_name = soup.find('h1',class_='post-full-title')`
6. Summarization
We use PlaintextParser library to summarize the language. The second parameter specifies the number of lines the text has to be summarised in to. Eg. if it is 10, it means that the whole text will be reduced to 10 lines.
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, 10)
text_summary = ""
for sentence in summary:
text_summary += str(sentence)
7. Structuring
``` python
intro_text = f'Today we will learn about {topic_name.text} '
outro_text=f'\nThis was all about {topic_name.text} . For code files and additional information, please visit www.iq.opengenus.org/'
`
full_text = intro_text+'\n'+text_summary+'\n'+outro_text
8. Conversion to Text and then to Audio
name=topic_name.get_text()+" summarised.mp3"
newpath2=output_file_location.split("//")
newpath=output_file_location.split("//")
newpath.append(name)
output_file_location="\\".join(newpath)
language = 'en'
speech = gTTS(text=text_summary, lang=language, slow=False)
speech.save(output_file_location)
os.system("mpg321 hello.mp3")
output_file_location="//".join(newpath2)
print(f"Converted URL {i} to summarised audio")
i+=1
print("Converted all URLs to summarised audio")
The following use cases are just partial implementations of the two specified above.
Convert URL contents to raw text-
For this use case, the flow of tasks shall be-
1. Defining the function and parameters
def url_to_raw_audiotext(input_urls_file,output_file_location):
2. Importing the required libraries
from gtts import gTTS
import requests
from bs4 import BeautifulSoup
import re
import os
3. Reading the links
input_urls_file=process_file_path(input_urls_file)
output_file_location=process_file_path(output_file_location)
with open(input_urls_file,'r') as file_in:
urls = []
for line in file_in:
urls.append(line.strip
4. Web Scraping
i=1
#Make HTTP request to the URL
for url in urls:
if len(urls)==0:
print("No URLS in input.txt")
break
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
5. Filtering out unnecessary content
unnecessary_tags=['script','a', 'style', 'table', 'iframe', 'aside','pre','ins','li','ul','google-auto-placed ap_container','L-Affiliate-Tagged'] for elem in soup(unnecessary_tags):
elem.extract()
text = soup.get_text()
text = ''.join(text).strip()
topic_name = soup.find('h1',class_='post-full-title')
6. Structuring
intro_text = f'Today we will learn about {topic_name.text} '
outro_text=f'\nThis was all about {topic_name.text} . For code files and additional information, please visit www.iq.opengenus.org/'
`
full_text = intro_text+'\n'+text+'\n'+outro_text
7. Conversion to Text
with open(topic_name.get_text()+'.txt', 'a') as f:
for word in full_text:
try:
f.write(word)
except:
f.write('error'+"\n"
i+=1
print("Converted all URLs to raw text")
input_urls_file="ENTER ADDRESS INCLUDING FILENAME SEPARATED BY DOUBLE SLASHES(\\)" output_file_location="ENTER ADDRESS TO THE WORKING DIRECTORY SEPARATED BY DOUBLE SLASHES(\\)" url_to_raw_text(input_urls_file,output_file_location)
Convert URL contents to summarised text-
For this use case, the flow of tasks shall be-
For this use case, the flow of tasks shall be-
1. Defining the function and parameters
def url_to_raw_audio(input_urls_file,output_file_location):
2. Importing the required libraries
from gtts import gTTS
import requests
from bs4 import BeautifulSoup
import re
import os
3. Reading the links<br>
input_urls_file=process_file_path(input_urls_file)
output_file_location=process_file_path(output_file_location)
with open(input_urls_file,'r') as file_in:
urls = []
for line in file_in:
urls.append(line.strip
4. Web Scraping
i=1
#Make HTTP request to the URL
for url in urls:
if len(urls)==0:
print("No URLS in input.txt")
break
response = requests.get(url)
soup = BeautifulSoup(response.content, 'lxml')
`
5. Filtering out unnecessary content
unnecessary_tags=['script','a', 'style', 'table', 'iframe', 'aside','pre','ins','li','ul','google-auto-placed ap_container','L-Affiliate-Tagged']
for elem in soup(unnecessary_tags):
elem.extract()
text = soup.get_text()
text = ''.join(text).strip()
topic_name = soup.find('h1',class_='post-full-title')`
6. Summarization
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, 10)
text_summary = ""
for sentence in summary:
text_summary += str(sentence)
7. Structuring
intro_text = f'Today we will learn about {topic_name.text}
outro_text=f'\nThis was all about {topic_name.text} . For code files and additional information, please visit www.iq.opengenus.org/
full_text = intro_text+'\n'+text_summary+'\n'+outro_text
8. Conversion to Text
with open(topic_name.get_text()+'.txt', 'a') as f:
for word in text_summary:
try:
f.write(word)
except:
f.write('error'+"\n")
print("Converted all URLs to summarised audio")
#running of the function
input_urls_file="ENTER ADDRESS INCLUDING FILENAME SEPARATED BY DOUBLE SLASHES(\\)"
output_file_location="ENTER ADDRESS TO THE WORKING DIRECTORY SEPARATED BY DOUBLE SLASHES(\\)"
url_to_summarised_audio(input_urls_file,output_file_location)
Summarise the provided text-
For this use case, the flow of tasks shall be-
1. Define Function Name & Parameters
def text_to_summary(input_file,output_file_full_name):
2. Process Addresses and Read File
from sumy.parsers.plaintext import PlaintextParser
from sumy.nlp.tokenizers import Tokenizer
from sumy.summarizers.text_rank import TextRankSummarizer
input_file=process_file_path(input_file)
output_file_full_name=process_file_path(output_file_full_name)
with open(input_file, 'r', encoding="utf-8") as file:
text = file.read()
3. Summarization
output_file = output_file_full_name.split("//")[-1]
parser = PlaintextParser.from_string(text, Tokenizer("english"))
summarizer = TextRankSummarizer()
summary = summarizer(parser.document, 10)
text_summary = ""
for sentence in summary:
text_summary += str(sentence)
4. Export as text
with open(output_file_full_name, "w", encoding="utf-8") as text_file:
text_file.write(text_summary)
Convert Audio to Video-
This use case is a bit different as it warrants the conversion of audio to video. The flow of tasks shall be-
1. Define function name and parameters
def convert_audio_to_video(input_file, output_file):
2. Process addresses and get audio from link
input_file=process_file_path(input_file)
output_file=process_file_path(output_file)
audio = AudioFileClip(input_file)
name = input_file.split("//")[-1][:-4]
3.Get the duration and frame rate of the audio
duration = audio.duration
fps = audio.fps
4. Instantiate a blank video clip with the same duration as the audio
width, height = 640, 480
fourcc = cv2.VideoWriter_fourcc(*"mp4v")
writer = cv2.VideoWriter(output_file, fourcc, fps, (width, height))
5. For each frame of the video-
for t in np.arange(0, duration, 1/fps):
6. Create a blank frame
frame = np.zeros((height, width, 3), dtype=np.uint8)
7. Add the text to the frame using OpenCV
font = cv2.FONT_HERSHEY_SIMPLEX
text_position = (int(width / 2 - len(name) * 5), int(height / 2))
cv2.putText(frame, name, text_position, font, 1, (255, 255, 255), 2, cv2.LINE_AA)
8. Write the frame to the video
writer.write(frame)
9. When all frames are done, release the video to the working directory.
writer.release()
Translate the provided text-
For this use case, the flow of tasks shall be-
1. Define Function Name & Parameters
def translate_text_file(input_file, output_file):
2. Provide Options for translation and/or and Read the file
We'll use the google translate API for this code. To make it easier for users, we can include an option for users to check the translation codes. Then, we can read the input file.
from googletrans import Translator
from googletrans import LANGUAGES
if input("Show Supported Languages and their code? (yes/no): ").lower()=="yes":
print("Supported Languages:")
for code, language in LANGUAGES.items():
print(f"{code}: {language}")
src_lang=input("enter source language code: ")
dest_lang=input("enter destination language code: ")
input_file=process_file_path(input_file)
output_file=process_file_path(output_file)
with open(input_file, 'r', encoding='utf-8') as file:
text = file.read()
```
3. Translation
with open(output_file_full_name, "w", encoding="utf-8") as text_file:
text_file.write(text_summary)
translator = Translator(service_urls=['translate.google.com'])
translated_text = translator.translate(text, src=src_lang, dest=dest_lang)
3. Export as text
with open(output_file, 'w', encoding='utf-8') as file:
file.write(translated_text.text)
Command Line Enabling
For easy accessibillity, we can integrate this code with the command line so that users can run them with simple commands. For this, we will need the argparse library.
Before we move on, we can define a few commands which hcan be used to invoke the respective functions.
URL to Audio - python opengenus_convertor.py --from=URL --to=audio
URL to Summarised Audio - python opengenus_convertor.py --from=URL --to=summarised audio
URL to Text - python opengenus_convertor.py --from=URL --to=text
URL to Summarised Text - python opengenus_convertor.py --from=URL --to=summarised text
Summarise Text- python opengenus_convertor.py --from=text --to=summarised text
Audio to Video - python opengenus_convertor.py --from=audio --to=video
We add the arguements -to anf -from to specify the conversion types,and --input and --output to specify the input and output locations. We then save them in variables and then, call the functions when their correspondong commands are typed.
import argparse
parser = argparse.ArgumentParser(description='Data Conversion from command line')
parser.add_argument('-f')
parser.add_argument('--from', dest='input_type', required=True,
help='Specify the input type: "url","text" or "audio". If converting from text or audio, ensure to write the full name of the output file. Else just mention the directory where the output should be located.')
parser.add_argument('--to', dest='output_type', required=True,
help='Specify the output type: "audio","summarised audio", "text","translated text", "summarised text", "video".')
parser.add_argument('--input', dest='input_file', required=True,
help='Specify the input data: URL file or text content')
parser.add_argument('--output', dest='output_file', required=True,
help='Specify the output file path for audio or summary')
args = parser.parse_args()
input_type = args.input_type.lower()
output_type = args.output_type.lower()
input_data = args.input_file
output_file = args.output_file
if input_type == 'url' and output_type == 'audio':
url_to_raw_audio(input_file, output_file)
elif input_type == 'url' and output_type == 'summarised audio':
url_to_summarised_audio(input_file, output_file)
elif input_type == 'url' and output_type == 'text':
url_to_text(input_file,output_data)
elif input_type == 'url' and output_type == 'summarised text':
url_to_text_summary(input_file,output_data)
elif input_type == 'text' and output_type == 'audio':
text_to_audio(input_file,output_file)
elif input_type == 'text' and output_type == 'summary':
text_to_summary(input_file,output_file)
elif input_type == 'text' and output_type == 'translated text':
translate_text_file(input_file,output_file)
elif input_type == 'audio' and output_type == 'video':
convert_audio_to_video(input_data)
else:
print('''please enter proper arguements. supported operations-\n1.url to raw_audio\n2.url to summarised_audio\n3.url to raw text\n4.url to summarised text\n5.text to audio\n6.text to summary\n7.audio to video\n8.Translate text''')
Running on the Command Line-
To summarise, we shall list the valid conversions, their command, the parameters taken as well as the links to a few sample conversions .
1.URL to Raw Audio-
For availing this use case, run the function url_to_raw_audio.
Command Line Statement: python opengenus_convertor.py --from=URL --to=audio
Parameters: 2- input_urls_file and output_file_location, both address parameters. Output file is created from scratch, any existing file with same name is overwritten.
Sample Input - a file containing a list of URLs
Sample Output - converted audio files at the working directory
2.URL to Summarised Audio
For availing this use case, run the function url_to_summarised_audio.
Command Line Statement: python opengenus_convertor.py --from=URL --to=summarised audio
Parameters: 2- input_urls_file and output_file_location, both address parameters. Output file is created from scratch, any existing file with same name is overwritten.
Sample Input - a file containing a list of URLs
Sample Output - converted and summarised audio files at the working directory
3.Url to Raw Text
For availing this use case, run the function url_to_raw_text.
Command Line Statement: python opengenus_convertor.py --from=URL --to=text
Parameters: 2- input_urls_file and output_file_location, both address parameters. Output file is created from scratch, any existing file with same name is overwritten.
Sample Input - a file containing a list of URLs
Sample Output - converted text files at the working directory
4.URL To Summarised Text
For availing this use case, run the function url_to_text_summary.
Command Line Statement: python opengenus_convertor.py --from=URL --to=summarised text
Parameters: 2- input_urls_file and output_file_location, both address parameters. Output file is created from scratch, any existing file with same name is overwritten.
Sample Input - a file containing a list of URLs
Sample Output - converted and summarized text files at the working directory
5.Text to Summary
For availing this use case, run the function text_to_summary.
Parameters: 2-Input_file,output_file_full_name. Here we need to specify name of the output file as well.
Command Line Statement: python opengenus_convertor.py --from=text --to=summarised text
Sample Input - a text file
Sample Output - summarised content at the working directory
6.Audio to Video
For availing this use case, run the function audio_to_video.
Parameters: 2- input_file, output_file, both address parameters of the input mp3 and output mp4 files, respectively. Output file is created from scratch, any existing file with same name is overwritten.
Command Line Statement: python opengenus_convertor.py --from=audio --to=video
Sample Input - an audio file
Sample Output - video file at the working directory
7.Translate text
For availing this use case, run the function translate_text_file.
Parameters: 2- input_file, output_file, both address parameters of the input and translated files, respectively. Output file is created from scratch, any existing file with same name is overwritten.
Command Line Statement: python opengenus_convertor.py --from=text --to=translated text
Sample Input - an input file in english
Expected Output - translated in french
If converting from text or audio, ensure to write the full name of the output file. Else just mention the directory where the output should be located
Link to the tool prototype - Github.
This concludes the documentation for the Article Convertor, Summarizer and Translator Tool Prototype.
Sign up for FREE 3 months of Amazon Music. YOU MUST NOT MISS.