h1

GSOC 2009 – Proposal for advanced search in wordpress

April 3, 2009

Name: Praveen Bysani
Email: praveen.iiith@gmail.com
Location: Hyderabad,India
webpage: http://web.iiit.ac.in/~lvsnpraveen

Title : Advanced search for WordPress

Problem:

A wordpress blog in general has several categories and spans across different topics,so users often have to search for posts of their interest.Present search in wordpress ,is very simple. With increasing number of posts, the user finds it very difficult to find a relevant result with current search.

The task is to create an advanced field based search engine for wordpress where user  can form granular queries, according to his interests in categories,tags,author etc. This project is important for wordpress because ,an advanced search interface improves user experience and save a lot of time .

An advanced query might look like , “Leopard in category “operating systems” or “Mac” under tags “apple” written by steve” .The search engine should narrow down the results to posts within category Mac,Operating systems having tags apple authored by steve and containing keyword leopard.

There are already plugins like SearchAll,SearchEverything to seach in attatchments,categories,tags,comments etc but their functionality is very limited.For instance “SearchEverything” , while searchin comment space,retrieves results only if query phrase is exactly same phrase in a comment .

Proposal:

My tentative solution to this problem is ,

1) Indexing:

First step is to create an index,the purpose of storing an index is to optimize performance of search. Different techniques are used for index storage like suffix tree,n-gram index,inverted index. Inverted indices is most popular technique for indexing,it stores list of word occurences in form of hash map with  words as key and document id’s as value.

In this context,a dump of MySql database is taken and an inverted index will be created.Fields like author,category,tags,publish date etc are indexed along with content words so that they can be used later for search.

Ex:      KEY                        VALUES
Author_steve          : postid1,postid2,postid1
Tag_apple               : postid1,postid2,postid43
Content_leopard      : postid1

where Author,Tag,Content are different fields indexed.

2) Query Parsing:

Second step is to build a query object.A parser is used to tokenize query and extract phrases and fields.Stemmer,stop word lists are used while tokenizing for robustness . A query object is built after parsing.

we can either translate natural language query to our parser syntax

Ex:  “Leopard in category Operating systems or Mac under tags apple written by steve” is translated into (author:steve AND tags:apple AND category:(operating systems OR mac) AND content:leopard )

or user can be restricted to use predefined syntax to form query, like most of the search engines does.

3)Scoring and Results:

Finally,query object is searched on index created in first step.Fields are provided with boost values according to their importance , for example user may want to give more importance to author than category,tags.

Resulting posts are ranked using similarity measures like cosine similarity between post,query or tf*idf value of post.

cosine similarity : query and post are considered as vectors in n-dimensional space, cosine of angle between them is considered as similarity(n is vocabulary of terms)

tf*idf value      :tf (term frequency) measure of how often a term appears in the document ,idf(inverse document frequency) measure of how often the term appears across the index

Although this approach looks quite complex, it is designed taking into account long term interests.Once the index,necessary components are built incorporating further changes(if needed) will be very easy.

As search is a core component in wordpress, this project needs to be implemented in core.

There are no major risks involved with this project,

Advantages of this approach:

  1. Proved to be working , considerably best approach as many large scale systems switch to indexes from databases
  2. with ranking of results ,user finds what he is looking for without browsing through all results
  3. Ability to make proximity,range,fuzzy,wildcard searches

Extra Functionality(If time Permits):

In addition to search in categories,tags,authors,plugins as mentioned in Wiki ideas page,we can also

  • Allow user to search in Comments of post
  • Allow search in range of publishing dates (Ex:posts published between 11th january 2009 and 23rd february 2009 etc)
  • Provide Media(Audio/video) content Filter in search
  • Provide Adult content filter in search

Schedule of Deliverables:
My planned schedule,tend to change after discussions with mentor

By Half way mark:

  1. Explore and find ways to create an index from database
  2. Build query object
  3. Hook these modules into wordpress

By End:

  1. Implement search
  2. weighting fields and ranking posts
  3. Better user interface for advanced search
  4. Including extra functionality depending on time left
  5. Testing the system and integration

I also work on TAC 2009,Update Summarization task during this time

Open Source Development Experience

Although i am user of various open source projects like nutch,lucene i never got a chance to contribute to them due to my academic commitments.Being in final year, i have enough time this summer to contribute to open source.I believe GSOC-2009 will be good starting point in this direction.

Work Experience

Worked as a Research Assistant in IE&R Lab for a period of three months.
Worked on several projects during my first three years,a brief description of my projects can be found at http://web.iiit.ac.in/~lvsnpraveen

Academic Experience

I am in fourth year of my Dual Degree programme( B.Tech + MS by research,computer science) at International Institute of Information Technology,Hyderabad(IIIT-H). my area of specialization is “Information retrieval”. I am excited to work on Search Related projects.

Courses:

Core Engineering:

  1. Programming in C
  2. Data Structures
  3. Artificial Intelligence
  4. Data Base management systems
  5. Software Engineering
  6. Algorithms
  7. Theory of Computation

Stream courses:

  1. Information extraction and retrieval
  2. Natural Language Processing
  3. Web data  and knowledge management
  4. Pattern Recognition

Why WordPress

Apart from zeal of contributing to opensource, my work mostly consists of information extraction and retrieval. This project is closely related to the same. The idea of working with potential web developers in wordpress is enthralling. I expect that working on this project is mutually benefitable.

h1

100% bitch ,got any problem with that !!

February 7, 2008

Don get into any wrong assumptions from the title…This bitch(post) is about no bitches or vulgarity.Actually its about cool one liners and quotes.I totally love them.I even try to phrase some, lets get to that later.

I wanna toss up here some of the naugtiest one liners i eva read/heard . Look at each1 again before u jump to the next one…

  • DONT DRINK WATER–Fishes Hav Sx in it
  • Ur Hot..Im coool..So letz get warm
  • If you are Rich..then I am Single
  • Let them WHISPER, we’ll STAYFREE!!!!
  • “If u can read this, then the bitch behind me has fallen down “(On back of biker)
  • Nice Legs!!! What time do they open
  • I am VEG,But I see,think and drink NONVEG
  • Opinions are like assholes, everyone’s got one!
  • Love is photogenic.It needs darkness to develop
  • Luck is like a bitch ,it always stays with rich.
  • Sorry if i look interested,Im not
  • Life is too short to date ugly gals
  • Im naked under my clothes.
  • Theres “U” in ugly but Definitely not “I”
  • Coffee. Chocolate. Women. Some things are better rich
  • sLeEp oNLy wHeN yOu aRe GeTtInG PaId fOr iT
  • When We Were Thirsty For Life .We Drank Vodka,hen We Were Thirsty For Water We Added Ice.
  • I am a lesbian trapped in a MAn’s body…

U think they are Asom, den u gotta read MY naughty creations/thoughts..They sound a little boasting,but nevertheless. Ok OK the following one liners are just for da sake of being.. dey arent abt me

  • I Lost Ma virginity………In ma dreams
  • Stop Calling me Sexy (gained some reputation for this ;) )
  • U hav A nice Horizantal Smile…No idea Abt ya Vertical…..{censored }
  • I am rated “A”
  • I wantd 2 kill sexiest person aliv..but den realised suicide is nt answer

There are some more, i jus cudnt get dem at da moment..

Thanx & ciao guys next JULY,

PraV.

h1

Interesting Trivia Abt HollyWood Movies…

July 31, 2007

As mentioned earlier…I watch a lot of English Movies,after watchin a movie i chk out abt dat movie @ imdb fr details(afteral im a 7 ptr,wat els will i do ??).Here are some Trivia whch i found Interestin;

 

 

 

300 :

Some weapons used in 300 are actually weapons from previous war epics like “Alexander” and “Troy.” They were used in this film to cut costs.

 

 

Girl Next Door :

Elisha Cuthbert spoke wid some adult actresses from wickedpictures and vivid entertainment ,to help with her role in dis movie. Originally, both ‘Brianna Banks’ and ‘Jenna Jameson (I)’ were slated to appear in this film.

 

Bourne Identity :

 

There are no opening credits besides the Title Card(U can Observe dis is very rare). Brad Pitt turned down the role of bourne for Spy Game(2001).

 

Primal Fear :

 

This Was Edward Nortons(Im a dying Fan of him) Debut film, 2100 actors are adduitioned for this role.Matt damon(Bourne) was one of them.

 

 

Snatch :

Every mistake that Sol, Vincent and Tyrone (dose niggers) make were inspired by various late-night TV shows about real-life crimes gone horribly wrong . Nearly every death in the movie takes place off-screen(Except Four fingers) .

 

Oceans Eleven and Twelve :

 

When Rusty Ryan Brad Pitt is teaching the “teen idols” to play poker, all of the actors are actual “teen idols” who were and/or are currently starring in popular TV Shows. The cast did gamble during off hours. While there’s disagreement between who won the most (George Clooney says Matt Damon, Damon says Brad Pitt), Clooney managed to lose 25 hands of blackjack in a row.

 

The Truman Show:

Its a Masterpiece.Jim Carey is da only human who can do that kinda action. coming to trivia, Every streetname in Seahaven(A town in da movie) refers to a movie actor, e.g. “Lancaster Square” or “Barrymore Road.”In an early scene on Truman and Meryl’s kitchen table is a bottle of vitamin D – needed for those without exposure to the (real) sun.(If u watch da movie,u can understand how they take care of such small n sensitiv things).

Troy:

Brad Pitt claimed the filming was torturous for him due to the fact that he had to quit smoking.Hetrained for six months to get into shape for the role. He trained to have his body look like that of Greek statues.

American History X:

The ‘F’ word is Used for 205 times in dis movie…which makes it da 2nd movie (pulp ficton-281 times) to have ‘F’ word more than 200 times.

Titanic :

This Movie needs no introduction,its truly a masterpiece.It tops the list of Box office with a overall colctn of more than 600 million dollars.But den it still cant find a place in Top 250 movies(IMDB).

After Watchin all these..i think im all set to cast n crew a film “7.me” ..wat u say ??

Hey Also Watch “About dis Blog“...i donno y it isnt showing desc in dis page[im a newbie].

 

 

 

 

 

h1

About This Blog !!!

July 31, 2007