Skip to main content

Updating the Transform

Hello everyone, its been quite a long time since I updated the post, So finally here is the recent work that I have done in past few weeks. As in the last blog post I was able to create a new transform script using lxml and I have mentioned the way I implemented that script.

As the code have been reviewed, then Jamie (mentor) pointed me a very important bug that I have comparing regular expression with the strings and not with the tags which was a bug in that script, so what I did is that when I am converting the whole input of string into a tree form and then iterating through every node and then replacing or removing the unwanted tags as required.

How to work with tree and replace tags ?

So basically what I did is I just took the whole document as a string and parse it into HTMLParser which converts the whole string into a tree like structure. So here in this tree we will have a parent node and then the child nodes and we will iterate through the whole tree and manipulate the nodes (or better call tags).

In the lxml tree structure the nodes are filled with elements (or tags) and we can then iterate over the tree and check for node and do manipulations accordingly. Also we can get the content between the tags using tag.text method.

So what I did here is first created a tree like this :-
                               parser = etree.HTMLParser()
                               tree = etree.parse(StringIO(html), parser)



Then it will create a tree and the tree variable right now is an object which tell us the address where this tree is stored when we print the tree.

Now we have a tree object so basically what we need is to iterate over the tree and this is bit easy work done like this :-
            for element in tree.getiterator():
                if(element.tag == 'h3' or element.tag == 'h4' or element.tag == 'h5'                     or element.tag == 'h6' or element.tag == 'div'):
                        element.tag = 'p'
               if(element.tag == "html" or element.tag == "body" or      

                            element.tag ==               "script"):
                    etree.strip_tags(tree, element.tag)


So this way we can iterate over the nodes and we can play with tags.

Why the cleaner function ?

After that we will convert the whole tree into string and then we will pass it to the cleaner function and here we will clean the html by removing Nasty tags and keeping only Valid tags and the cleaner function will again return a string of filtered html. We will give string to cleaner function so we will first convert the tree into the string and this is how it is done :-
               result = etrree.tostring(tree.getroot(), pretty_print=True,                                method="html")

After that we will pass the result to the cleaner function where the string will be cleaned or filtered like this :-

NASTY_TAGS = frozenset(['style', 'script', 'object', 'applet', 'meta', 'embed'])     
  cleaner=HTMLParser(kill_tags=NASTY_TAGS,page_structure=False,                                                     safe_attrs_only=False)

safe_html = fragment_fromstring(cleaner.clean_html(result))

Here we have also created the fragments of the cleaned string.

Why to fragment the clean html string ?
 We will fragment the string so that we can remove the additionally added parent tag which usually created when we convert the string into the tree and it get appended and creates false results. So we create fragments of the single string and then again convert it into string. This seems like quite stupid to create fragment and convert it back to string but this is the way I found to remove extra tags.

So after the final string we obtain is the final output of the transform and seems like all test cases are passing.



Yayaya!! Its always good to see all the test cases passing. Hopefully you like reading this. Next time I will describe more about the testing part of the transform.



Cheers,
  





Comments

Popular posts from this blog

Summarizing Summer

It was a great pleasure to work with great coders/developers at Plone Foundation and learning a lot from them during the course of the program. This blog is about summarizing the whole work I have done during this summer under Google summer of code 2016 under Plone Foundation. So my project mainly focuses on improving forms in plone for dexterity. We already have forms in plone for archetype. So there is a project named collective.easyform which basically provide forms for Plone as dexterity contenttype. The main focus of the project was to improve that code base. Make this stable for plone 5.0 and above. Make all the test cases passing for the code base. Try to cover tests as much as possible code base. Also implement functionalities for fields/actions of the forms in correct place. Make the plone more user friendly. Finally for users who want to migrate their already present forms in Plone Form Gen (PFG) which is archetype to easyforms which are dexterity based forms Plone. This w...

Testing the transform

Hello everyone, now the transform for filtering html is ready and the main task is to test the transform. For that purpose I have set up the whole test environment for my add-on using testing.py for unit tests and robot tests. After setting up the environment, now its time to first write unit test for transform that we have just created to check if they are all passing and the transform is working properly or not. For creating unit test I first created test class and in that class I just call the convert function that we have created in the transform and give the input as a data stream and pass it to convert function and then get the output as required. After writing few simple test cases like 30-35 then just ran these test cases and they ran successfully. Test cases ran successfully locally :- Travis is also happy ;) Yayayay!!! Finally test cases were passing so its like a milestone for the project and its completed. The PR got merged and things working good as expect...

One Last Time

Okay. So finally the list of GSOC'16 selected students is out and I am glad that my project got selected under Plone Foundation. I am really looking forward to work on my project which is basically improving the add-on named easyforms. I have already started working on the same project and will plan with my mentor for future work. I will keep this blog updated with the work I will be doing during the project as I did last summer too. Cheers.