Hello everyone, its been quite a long time since I updated the post, So finally here is the recent work that I have done in past few weeks. As in the last blog post I was able to create a new transform script using lxml and I have mentioned the way I implemented that script.
As the code have been reviewed, then Jamie (mentor) pointed me a very important bug that I have comparing regular expression with the strings and not with the tags which was a bug in that script, so what I did is that when I am converting the whole input of string into a tree form and then iterating through every node and then replacing or removing the unwanted tags as required.
How to work with tree and replace tags ?
So basically what I did is I just took the whole document as a string and parse it into HTMLParser which converts the whole string into a tree like structure. So here in this tree we will have a parent node and then the child nodes and we will iterate through the whole tree and manipulate the nodes (or better call tags).
In the lxml tree structure the nodes are filled with elements (or tags) and we can then iterate over the tree and check for node and do manipulations accordingly. Also we can get the content between the tags using tag.text method.
So what I did here is first created a tree like this :-
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
Then it will create a tree and the tree variable right now is an object which tell us the address where this tree is stored when we print the tree.
Now we have a tree object so basically what we need is to iterate over the tree and this is bit easy work done like this :-
for element in tree.getiterator():
if(element.tag == 'h3' or element.tag == 'h4' or element.tag == 'h5' or element.tag == 'h6' or element.tag == 'div'):
element.tag = 'p'
if(element.tag == "html" or element.tag == "body" or
element.tag == "script"):
etree.strip_tags(tree, element.tag)
So this way we can iterate over the nodes and we can play with tags.
Why the cleaner function ?
After that we will convert the whole tree into string and then we will pass it to the cleaner function and here we will clean the html by removing Nasty tags and keeping only Valid tags and the cleaner function will again return a string of filtered html. We will give string to cleaner function so we will first convert the tree into the string and this is how it is done :-
result = etrree.tostring(tree.getroot(), pretty_print=True, method="html")
After that we will pass the result to the cleaner function where the string will be cleaned or filtered like this :-
NASTY_TAGS = frozenset(['style', 'script', 'object', 'applet', 'meta', 'embed'])
cleaner=HTMLParser(kill_tags=NASTY_TAGS,page_structure=False, safe_attrs_only=False)
safe_html = fragment_fromstring(cleaner.clean_html(result))
Here we have also created the fragments of the cleaned string.
Why to fragment the clean html string ?
We will fragment the string so that we can remove the additionally added parent tag which usually created when we convert the string into the tree and it get appended and creates false results. So we create fragments of the single string and then again convert it into string. This seems like quite stupid to create fragment and convert it back to string but this is the way I found to remove extra tags.
So after the final string we obtain is the final output of the transform and seems like all test cases are passing.
Yayaya!! Its always good to see all the test cases passing. Hopefully you like reading this. Next time I will describe more about the testing part of the transform.
Cheers,
As the code have been reviewed, then Jamie (mentor) pointed me a very important bug that I have comparing regular expression with the strings and not with the tags which was a bug in that script, so what I did is that when I am converting the whole input of string into a tree form and then iterating through every node and then replacing or removing the unwanted tags as required.
How to work with tree and replace tags ?
So basically what I did is I just took the whole document as a string and parse it into HTMLParser which converts the whole string into a tree like structure. So here in this tree we will have a parent node and then the child nodes and we will iterate through the whole tree and manipulate the nodes (or better call tags).
In the lxml tree structure the nodes are filled with elements (or tags) and we can then iterate over the tree and check for node and do manipulations accordingly. Also we can get the content between the tags using tag.text method.
So what I did here is first created a tree like this :-
parser = etree.HTMLParser()
tree = etree.parse(StringIO(html), parser)
Then it will create a tree and the tree variable right now is an object which tell us the address where this tree is stored when we print the tree.
Now we have a tree object so basically what we need is to iterate over the tree and this is bit easy work done like this :-
for element in tree.getiterator():
if(element.tag == 'h3' or element.tag == 'h4' or element.tag == 'h5' or element.tag == 'h6' or element.tag == 'div'):
element.tag = 'p'
if(element.tag == "html" or element.tag == "body" or
element.tag == "script"):
etree.strip_tags(tree, element.tag)
So this way we can iterate over the nodes and we can play with tags.
Why the cleaner function ?
After that we will convert the whole tree into string and then we will pass it to the cleaner function and here we will clean the html by removing Nasty tags and keeping only Valid tags and the cleaner function will again return a string of filtered html. We will give string to cleaner function so we will first convert the tree into the string and this is how it is done :-
result = etrree.tostring(tree.getroot(), pretty_print=True, method="html")
After that we will pass the result to the cleaner function where the string will be cleaned or filtered like this :-
NASTY_TAGS = frozenset(['style', 'script', 'object', 'applet', 'meta', 'embed'])
cleaner=HTMLParser(kill_tags=NASTY_TAGS,page_structure=False, safe_attrs_only=False)
safe_html = fragment_fromstring(cleaner.clean_html(result))
Here we have also created the fragments of the cleaned string.
Why to fragment the clean html string ?
We will fragment the string so that we can remove the additionally added parent tag which usually created when we convert the string into the tree and it get appended and creates false results. So we create fragments of the single string and then again convert it into string. This seems like quite stupid to create fragment and convert it back to string but this is the way I found to remove extra tags.
So after the final string we obtain is the final output of the transform and seems like all test cases are passing.
Yayaya!! Its always good to see all the test cases passing. Hopefully you like reading this. Next time I will describe more about the testing part of the transform.
Cheers,
Comments
Post a Comment