Skip to main content

Updating the Transform

Hello everyone, its been quite a long time since I updated the post, So finally here is the recent work that I have done in past few weeks. As in the last blog post I was able to create a new transform script using lxml and I have mentioned the way I implemented that script.

As the code have been reviewed, then Jamie (mentor) pointed me a very important bug that I have comparing regular expression with the strings and not with the tags which was a bug in that script, so what I did is that when I am converting the whole input of string into a tree form and then iterating through every node and then replacing or removing the unwanted tags as required.

How to work with tree and replace tags ?

So basically what I did is I just took the whole document as a string and parse it into HTMLParser which converts the whole string into a tree like structure. So here in this tree we will have a parent node and then the child nodes and we will iterate through the whole tree and manipulate the nodes (or better call tags).

In the lxml tree structure the nodes are filled with elements (or tags) and we can then iterate over the tree and check for node and do manipulations accordingly. Also we can get the content between the tags using tag.text method.

So what I did here is first created a tree like this :-
                               parser = etree.HTMLParser()
                               tree = etree.parse(StringIO(html), parser)



Then it will create a tree and the tree variable right now is an object which tell us the address where this tree is stored when we print the tree.

Now we have a tree object so basically what we need is to iterate over the tree and this is bit easy work done like this :-
            for element in tree.getiterator():
                if(element.tag == 'h3' or element.tag == 'h4' or element.tag == 'h5'                     or element.tag == 'h6' or element.tag == 'div'):
                        element.tag = 'p'
               if(element.tag == "html" or element.tag == "body" or      

                            element.tag ==               "script"):
                    etree.strip_tags(tree, element.tag)


So this way we can iterate over the nodes and we can play with tags.

Why the cleaner function ?

After that we will convert the whole tree into string and then we will pass it to the cleaner function and here we will clean the html by removing Nasty tags and keeping only Valid tags and the cleaner function will again return a string of filtered html. We will give string to cleaner function so we will first convert the tree into the string and this is how it is done :-
               result = etrree.tostring(tree.getroot(), pretty_print=True,                                method="html")

After that we will pass the result to the cleaner function where the string will be cleaned or filtered like this :-

NASTY_TAGS = frozenset(['style', 'script', 'object', 'applet', 'meta', 'embed'])     
  cleaner=HTMLParser(kill_tags=NASTY_TAGS,page_structure=False,                                                     safe_attrs_only=False)

safe_html = fragment_fromstring(cleaner.clean_html(result))

Here we have also created the fragments of the cleaned string.

Why to fragment the clean html string ?
 We will fragment the string so that we can remove the additionally added parent tag which usually created when we convert the string into the tree and it get appended and creates false results. So we create fragments of the single string and then again convert it into string. This seems like quite stupid to create fragment and convert it back to string but this is the way I found to remove extra tags.

So after the final string we obtain is the final output of the transform and seems like all test cases are passing.



Yayaya!! Its always good to see all the test cases passing. Hopefully you like reading this. Next time I will describe more about the testing part of the transform.



Cheers,
  





Comments

Popular posts from this blog

Summarizing Summer

It was a great pleasure to work with great coders/developers at Plone Foundation and learning a lot from them during the course of the program. This blog is about summarizing the whole work I have done during this summer under Google summer of code 2016 under Plone Foundation. So my project mainly focuses on improving forms in plone for dexterity. We already have forms in plone for archetype. So there is a project named collective.easyform which basically provide forms for Plone as dexterity contenttype. The main focus of the project was to improve that code base. Make this stable for plone 5.0 and above. Make all the test cases passing for the code base. Try to cover tests as much as possible code base. Also implement functionalities for fields/actions of the forms in correct place. Make the plone more user friendly. Finally for users who want to migrate their already present forms in Plone Form Gen (PFG) which is archetype to easyforms which are dexterity based forms Plone. This w

One Last Time

Okay. So finally the list of GSOC'16 selected students is out and I am glad that my project got selected under Plone Foundation. I am really looking forward to work on my project which is basically improving the add-on named easyforms. I have already started working on the same project and will plan with my mentor for future work. I will keep this blog updated with the work I will be doing during the project as I did last summer too. Cheers.

Tickling with tests

Hello everyone, So finally we have been at the end of the project and I really enjoyed each and every part of it. So after unit testing the transform its time to write functional tests and integration tests or we can say the browser tests for our add-on. I have written the functional tests to ensure that the new add-on is imported and all the profiles are installed and the editor is using our new transform and not the old transform. So for that we have already implemented the registration of new add-on to replace it with old one and we also have to make tinyMCE uses our new transform. How to make tinyMCE uses our new transform in place of old transfrom ? So for understanding that we should have the idea how tinyMCE calls for the transform script. So it uses getToolByName in portal_transform to get the required transform. So tinyMCE search for safe_html in portal_transform and the old transform was named safe_html. So here we had two ways to proceed :- 1) Either change tinyMCE