Project ideas for 2013


If you are interested in any of these projects, please contact Helen Ashman to discuss.

Projects in the SearchScraper programme

How does Web search work?
We have been analysing search engine results for over a year now and have found some curious differences between search engines, such as how much they overlap, how they update their indices, and that different types of search terms have different lifespans in the top ten search results.

In this project you will be working to find out whether image searches have similar differences and whether images last longer in the top N, change more suddenly and how much overlap there is between the search engines in image results.

Upscaling of search engine analysis software
In this project you will be working to rebuild some existing software so as to be able to perform large-scale search analyses. The software is currently operating with fewer than 100 different search terms and this project involves rewriting the software to upscale this to thousands of queries. This can be achieved by making use of cloud services to outsource the computational load so the project will involve use and integration of amazon's cloud services.

Projects in the Behavioural intrusion detection programme

Behavioural intrusion detection
Intrusion detection and prevention systems aim to discover when an intruder has broken into a computer system and to act when an intrusion is detected. In this project, you will be working to build software to discover intruders masquerading as real users by comparing the behaviour of the intruders against the normal behaviour of the real user. This could be based on their typing speeds, favourite websites and applications, the way they write emails, tweets or even the way they use the command line. We have already trialled some user characteristics that appear to work well and now need to build an intrusion detection system that combines different characteristics in the optimal way to detect intruders as quickly as possible.
Profiling intruders using honeypots
Our behavioural intrusion detection method creates profiles of legitimate users and reauthenticates them constantly against this profile. However we can also create profiles of intruders, knowing that they are intruders, by setting up honeypots and learning intruder profiles from their actions on the honeypot. This can then be used to determine if the same intruder is reappearing, and to collect evidence against an intruder.

This project involves setting up a honeypot and then setting up profile capturing software to create intruder profiles.

An extension of this project is to try to characterise intruders if their collective behaviour is different to that of normal users of the network. For example, do they change directory and list contents more often than normal users? Do they use su or sudo commands more often than normal users? The aim is to see if we can characterise 'intruder behaviour'.

This project will be jointly supervised by Dr Raymond Choo.

Intrusion detection in a chatroom
Networks are not the only place that intrusion detection could be useful. For example, in chatrooms, sometimes people connect by using the credentials of a relative or friend and then masquerade as the legitimate user. This can be a serious problem if the chatroom is a place where children like to talk, as it provides 'grooming' opportunities for perverts.

In a chatroom version of the behavioural intrusion detection system, users of a chatroom can be reauthenticated according to their normal behaviour so if someone is using another person's credentials, they will be detected.

We have some chatroom data available to us. That data also includes non-human users of the system so another outcome is to be able to detect any non-human users from their profiles, a sort of Turing Test.

Projects in the Majority Web programme

Reputation problems in search engines' autocomplete function
Here's a story that exemplifies the reputational problems arising from search engine algorithms:
This news article is about the autocomplete function of a search engine, which is where you start typing into the little search window and a number of possibilities pop up, based on what you are typing. It is based on what other people have typed in previous searches, and tends to list the most popular searches with those opening characters. In a related news article (linked underneath the main story), it seems that someone has deliberately created an entry in the autocomplete function by sending in a scurrilous query many times so that it appears in the autocomplete.

So it seems that Google's results are not only open to 'gaming' by the search engine optimisation companies, but also within the autocomplete function - just get enough people/processes typing in a given search and it will start to pop up. The project I want to propose is: how quickly can we 'plant' an autocomplete entry? How many times do we need to submit a query for an autocomplete entry to appear?

This would involve setting up some new queries within software that we already have working in the lab so that we can disguise the origin of the query and randomise its frequency, so repeated queries will look like many different people. Of course, we wouldn't want to create a defamatory query. Instead we could create a completely made-up situation, perhaps something positive but false such as associating solar activity with increased happiness - yes I know it sounds wacky but there are quite a lot of results on Google for 'solar activity with increased happiness', and they are not (yet) coming up in the autocomplete function (except in my own local autocomplete as a result of typing it in once).

Mutual relevance ranking of Web resources
There is an assumption that when a person 'coselects' two results from the same set of results for one search query, that those two results are 'about' the same thing. So if I search on 'apple' and make two selections, the assumption is that if I am looking for something about apple trees, then I am not going to select anything about apple computers. We did a pilot experiment to test this 'mutual relevance' of coselected Web resources and it does indeed seem to be valid.

In this project, you will help do another bigger experiment that will statistically confirm the mutual relevance assumption, and which will also generate information about what sort of relationship there is between different sorts of queries, and will help us look for patterns in different kinds of relationships between queries. Some queries are definitely about the same thing (e.g. Castle Pernstejn and Hrad Pernstejn), while others are a similar topic but not the same thing (e.g. iPad and MacBook).

Data mining synonyms and translations from people's interactions with search engines
We collect what is called 'click' data from people's interactions with Web search engines and are using it to create clusters of Web resources based on the search term they came from. Once we have these clusters, we sometimes need to 'glue' them together on occasions when they are separated but ought not be. This happens because the data is quite sparse.

In this project you will be using what is called 'cluster overlap by population' to stick together any clusters that have the same label but are not already stuck together but have lots of contained pages in common. Once we have these, we can then use the cluster overlap method to discover synonyms (where the search terms are in the same language) and translations (where they are in different languages).

How reliable is clickthrough data?
We assume that when a user selects a result from a search engine's results page, that the URL selected is indeed relevant to the search. In WSL, we have shown that this is true, in a big experiment, but only for image search. We now need to do a similar evaluation for text searches, and to then compare the results to those from image searches, to see which is best.

Projects in the Personalisation programme

Personalised Web search - does it work?
Every time we make a search, the search engine decides what we get to see. In 2009, web search started to become personalised, with the search engine trying to double-guess what we really want. But how accurate is the search engine's guess? How satisfied are people with what they are served up? Are there times that people would be better off with 'normal' search results that are not personalised?

In this project you will be working to collect some data and answer some of these questions. We will analyse the difference between personalised and non-personalised results, and discover whether people are more satisfied with personalised results.

Privacy-enhanced personalisation of Web search
It has been said of at least one Internet application that:
If you're not paying for it, you're not the customer. You're the product being sold.
This observation might also be made of search engines. They amass large quantities of personal data from every user of their services. It is unlikely that the business model of search engines is philanthropic, so clearly their income derives from other sources, such as advertising. Targeted advertising is increasingly popular with advertisers as it offers access to better sales prospects than blanket advertising, but to achieve this, personal information about users must be available. Companies such as Experian in the UK sell such data for marketing purposes, and with targeted advertising now appearing alongside normal search results, it is evident that search engines are using the personal data they collect to provide targeted search opportunities to their advertisers. In fact, it is even claimed that some search engine corporations are no longer in the search business but rather are now in the marketing business. Given this viewpoint, one might wonder if the personalisation of content is offered not to improve the user experience (as it may do quite the opposite) but to offer a publicly palatable reason for the collection of enormous quantities of personal data.

Under Australia's National Privacy Principles and the European Union's Data Privacy Directive, the user's informed consent must be sought prior to data collection. However, the user cannot rely on corporations to abide by local privacy laws, especially when data is managed offshore by international corporations. Also it seems that even where privacy laws explicitly forbid certain types of data collection, it still takes place (as evidenced by the illegal collection of household network data by Google StreetView from many countries during 2010 and more recently the apparently inadvertent bypassing of 'do not track' browsers instructions).

All this means that the user must be vigilant about outgoing communications from their personal devices. In some cases, personal data is deliberately released by users, such as on social networks. However it is where personal data is collected surreptitiously that this proposal focuses. At present the control over what data is collected lies largely with the search engines themselves.

This project proposes a proxy architecture that will reverse that situation, so that the release of personal data is governed at the user's end, not at the search engine end. The user will still be able to make use of personalised search, however that personalisation will be performed by the proxy which will run locally to the user, not by the search engine. Use of amazon's cloud services and/or the Tor anonymising service will ensure that location information is suppressed.