Name: Praveen Bysani
Email: praveen.iiith@gmail.com
Location: Hyderabad,India
webpage: http://web.iiit.ac.in/~lvsnpraveen
Title : Advanced search for WordPress
Problem:
A wordpress blog in general has several categories and spans across different topics,so users often have to search for posts of their interest.Present search in wordpress ,is very simple. With increasing number of posts, the user finds it very difficult to find a relevant result with current search.
The task is to create an advanced field based search engine for wordpress where user can form granular queries, according to his interests in categories,tags,author etc. This project is important for wordpress because ,an advanced search interface improves user experience and save a lot of time .
An advanced query might look like , “Leopard in category “operating systems” or “Mac” under tags “apple” written by steve” .The search engine should narrow down the results to posts within category Mac,Operating systems having tags apple authored by steve and containing keyword leopard.
There are already plugins like SearchAll,SearchEverything to seach in attatchments,categories,tags,comments etc but their functionality is very limited.For instance “SearchEverything” , while searchin comment space,retrieves results only if query phrase is exactly same phrase in a comment .
Proposal:
My tentative solution to this problem is ,
1) Indexing:
First step is to create an index,the purpose of storing an index is to optimize performance of search. Different techniques are used for index storage like suffix tree,n-gram index,inverted index. Inverted indices is most popular technique for indexing,it stores list of word occurences in form of hash map with words as key and document id’s as value.
In this context,a dump of MySql database is taken and an inverted index will be created.Fields like author,category,tags,publish date etc are indexed along with content words so that they can be used later for search.
Ex: KEY VALUES
Author_steve : postid1,postid2,postid1
Tag_apple : postid1,postid2,postid43
Content_leopard : postid1
where Author,Tag,Content are different fields indexed.
2) Query Parsing:
Second step is to build a query object.A parser is used to tokenize query and extract phrases and fields.Stemmer,stop word lists are used while tokenizing for robustness . A query object is built after parsing.
we can either translate natural language query to our parser syntax
Ex: “Leopard in category Operating systems or Mac under tags apple written by steve” is translated into (author:steve AND tags:apple AND category:(operating systems OR mac) AND content:leopard )
or user can be restricted to use predefined syntax to form query, like most of the search engines does.
3)Scoring and Results:
Finally,query object is searched on index created in first step.Fields are provided with boost values according to their importance , for example user may want to give more importance to author than category,tags.
Resulting posts are ranked using similarity measures like cosine similarity between post,query or tf*idf value of post.
cosine similarity : query and post are considered as vectors in n-dimensional space, cosine of angle between them is considered as similarity(n is vocabulary of terms)
tf*idf value :tf (term frequency) measure of how often a term appears in the document ,idf(inverse document frequency) measure of how often the term appears across the index
Although this approach looks quite complex, it is designed taking into account long term interests.Once the index,necessary components are built incorporating further changes(if needed) will be very easy.
As search is a core component in wordpress, this project needs to be implemented in core.
There are no major risks involved with this project,
Advantages of this approach:
- Proved to be working , considerably best approach as many large scale systems switch to indexes from databases
- with ranking of results ,user finds what he is looking for without browsing through all results
- Ability to make proximity,range,fuzzy,wildcard searches
Extra Functionality(If time Permits):
In addition to search in categories,tags,authors,plugins as mentioned in Wiki ideas page,we can also
- Allow user to search in Comments of post
- Allow search in range of publishing dates (Ex:posts published between 11th january 2009 and 23rd february 2009 etc)
- Provide Media(Audio/video) content Filter in search
- Provide Adult content filter in search
Schedule of Deliverables:
My planned schedule,tend to change after discussions with mentor
By Half way mark:
- Explore and find ways to create an index from database
- Build query object
- Hook these modules into wordpress
By End:
- Implement search
- weighting fields and ranking posts
- Better user interface for advanced search
- Including extra functionality depending on time left
- Testing the system and integration
I also work on TAC 2009,Update Summarization task during this time
Open Source Development Experience
Although i am user of various open source projects like nutch,lucene i never got a chance to contribute to them due to my academic commitments.Being in final year, i have enough time this summer to contribute to open source.I believe GSOC-2009 will be good starting point in this direction.
Work Experience
Worked as a Research Assistant in IE&R Lab for a period of three months.
Worked on several projects during my first three years,a brief description of my projects can be found at http://web.iiit.ac.in/~lvsnpraveen
Academic Experience
I am in fourth year of my Dual Degree programme( B.Tech + MS by research,computer science) at International Institute of Information Technology,Hyderabad(IIIT-H). my area of specialization is “Information retrieval”. I am excited to work on Search Related projects.
Courses:
Core Engineering:
- Programming in C
- Data Structures
- Artificial Intelligence
- Data Base management systems
- Software Engineering
- Algorithms
- Theory of Computation
Stream courses:
- Information extraction and retrieval
- Natural Language Processing
- Web data and knowledge management
- Pattern Recognition
Why WordPress
Apart from zeal of contributing to opensource, my work mostly consists of information extraction and retrieval. This project is closely related to the same. The idea of working with potential web developers in wordpress is enthralling. I expect that working on this project is mutually benefitable.


