COMP 4027 Forensic and Analytical Computing


Information about events and incidents are not just discoverable from viewing the activity history of someone as they interact with a file system or network. Sometimes evidence of activities is recorded in the information itself, such as an edit history of a file which might include information on who created the file, who last edited it, what those edits were, and so on.

A similar principle can be applied to what might be called "genetic forensics" where historic genetic events can be discovered through investigating the human genome.

Looking in files for file histories

In this section we look at the forensics of information, as opposed to devices and infrastructure.

There are many opportunities to discover information from files, that were not intended for public viewing. This generally occurs in the file's metadata which is non-viewing material, usually inserted by the editing application for purposes of version control, revision tracking and recording system information, but some of it also records information explicitly inserted by the author, such as comments.

These problems almost always occur because of the so-called WYSIWYG editing tools, where purportedly "What You See Is What You Get". However, nothing could be further from the truth - when you look inside the files created and maintained by these editing applications, when it becomes obvious that "what you get" is rather more than "what you see". The unfortunate author judges the file's suitability for release based on the "what you see" but is not aware of all the additional information that leaks out through the metadata.

Some stories to set the stage

There are many stories of embarrassment caused by metadata being retrieved from files (frequently .doc or .docx) being released. While some are merely amusing, others demonstrate questionable behaviour.

For example, the UK government published a document, the famous "dossier", that supported their position regardinging UK involvement in the USA-Iraq war. This document was discovered to be in part plagiarised from a graduate student's work [8]. Unfortunately for the authors, it was released as a Microsoft Word document and contained a quantity of metadata that revealed enough information to identify the Government staff who had been involved in the plagiarism [1].

Sometimes the information leaked through metadata can cause embarrassment rather than showing evidence of misconduct. Microsoft has had its share of embarrassment, with examples including the allegation that the Microsoft 1999 Annual Report was produced on a Mac rather than a Windows system [2], and that the pdf of a Microsoft marketing document was "a pdf produced on a Macintosh using QuarkExpress as can be seen by looking at the last few lines of the pdf code" [7].

Other stories involve the leakage of information to customers and clients that cause ill-feeling, such as sending in quotes for services or goods where the metadata reveals that the quote is a revision of quotes sent to others, with much lower prices. Similarly, private information about who is recruiting through a consulting company has been leaked through metadata [4].

These stories all show the importance of metadata for knowledge discovery, sometimes for the purpose of investigating criminal or otherwise prohibited activity. There are now complete industries dedicated to metadata removal (e.g. [12]) and professionals in many fields are being advised to "scrub" their files before releasing them not just publicly, but to customers (e.g. [14]).

We will now look at how metadata can be retrieved.

Finding metadata

It is very simple in Microsoft Word to look at some of this metadata merely by going to "Properties" item on the "File" menu. Some basic data about the document and its histories can be seen in the "Properties" of the file. Information about the document creation date, last save, last editing author are visible, such as in this properties extract from the Iraq war dossier:

Other data such as custom properties are also visible, such as this job advertisement:

More detailed data is contained, and if you have access to data viewing tools (almost none seem to be available for the Macintosh!), it is possible to view further information. For example the last 10 revisions to the Iraq dossier can be revealed, showing all four of the authors involved in the creation of the document (and presumably also involved in the lifting of words from previously-published works without attribution) [4]:

What is being kept in metadata

A most insightful discussion on how metadata can expose edit histories can be found in Microsoft's own documentation [3]. We will quickly summarise here how such information is retained in documents.

(Refer directly to WD97: How to Minimize Metadata in Microsoft Word Documents)

Microsoft has software additions to remove such metadata prior to publishing the document [9] [10]. There are also numerous third-party "data scrubbing" applications available commercially for MS Office and PDF. However there are those who suggest that the best way to remove any potentially embarrassing data is to ensure that files are only ever released as PDF or HTML [4], since Microsoft's "save as" facility apparently does not duplicate edit histories in these formats. Note however that it is still possible to recover some information such as author and creation data from PDF files [13]. See for example the basic author, date and system details from a PDF created by the course convenor:

Even HTML files can sometimes contain commented code or metadata that the authors did not mean to be available. However more professional companies tend to have less such leakage, although one can still see that UniSA tracks usage of their pages:

... and you can see instructions to programmers at:

Some further suggestions for viewing and removing metadata, within MS Office files and PDFs can be found in the reference list below, e.g. [11] [12]. However it seems that very few of the solutions are offered for the Mac, including Office 2008 for Mac.

Event histories in the human genome

Interestingly, in [5], the authors writes that "Technically, metadata is sort of the DNA of documents created with modern word-processing software. " This is exactly what we are now going to look at - how DNA, or more accurately, the genome, can reveal information about past events.

By looking at common DNA sequences and, more importantly, differences, it is possible to reconstruct a plausible history of how humans are biologically related to all other living organisms. In addition, by using what is called "the molecular clock" (along with other dating systems like carbon dating ad ardioactive dating of rocks) it is possible to estimate the approximate time (i.e. number of generations) that any mutation (i.e. branching of species) occurred. The molecular clock essentially characterises the average rate of mutation or variability in DNA during cell division, so an approximate number of generations can be worked out for any form of mutation to be probable. Interestingly, this rate of mutation appears to be largely independent of species.

In addition to discovering inter-species relationships and approximately dating species divisions, it is also possible to date certain events in genetic (and human) history. Two such events are the appearance of two different sorts of Haemoglobin, or red blood cells (which carry oxygen around the body) and the appearance of colour vision.

Both of these mutations are what is called "duplication events" where the copying of DNA from a cell to its division (sister cell) accidentally makes two copies of part-sequences of DNA. It is believed that duplication events are responsible for the growth in how much DNA is contained in cells (although a lot is also "absorbed" from bacterial DNA, some of which is known as mitochondrial DNA). Dawkins explains these duplication events as follows [15]:

New genes aren't added to the genome out of thin air. They originate as duplications of older genes. Then, over evolutionary time, they go their separate ways by mutation, selection and drift. We don't usually see this happening but, like detectives arriving on the scene after a crime, we can piece together what must have happened from the evidence that remains.

Colour vision

Colour vision is an excellent exemplar of genetic duplication and mutation. The story told here is a gross simplification, told accurately, engagingly and in detail in [15], but the elements of how we detect how we became "trichromatic" are summarised here.

We all have photosensitive ("light-sensitive") cells inside our eyes called cones. Each cone is photosensitive to a small range within the visible light spectrum, for example the "green" cones being most responsive to the green part of the light spectrum, the "blue" cones being most responsive to the violet-blue end of the spectrum and the "red" cones most responsive to the orange-red part of the spectrum.

At one stage, the ancestor species of humans (and many other related species) was only dichromatic, sensitive only to two distinct colours (green and blue). Many mammals today are still only dichromatic. However humans became trichromatic (sensitive to three colours, red green and blue) as a result firstly of a duplication event, where the gene responsible for cones seeing green was duplicated, so that there were two "green" genes, as well as the one "blue" gene. Then through a gradual process of mutation and natural selection, one of the duplicated "green" genes mutated so that the range of its photosensitivity shifted along the visible light spectrum, so that the cones it generated were now no longer (as) sensitive to green, but became more sensitive to the colour in the middle of its new spectral range, red.

The Haemoglobin molecule

Another different example shows us how we can trace the order of events. The Haemoglobin molecule consists of four quite similar parts, called globin chains, that occur in two different types. There are four globins, two type alpha and two type beta. Given that the probability of two identical genes arising through distinct mutation processe is almost zero*, we can deduce how the original single-globin gene firstly underwent a duplication event so that there were duplicate genes for globins, then gradually they evolved to become different enough to each other, giving rise to the two distinct globin gene types, alpha and beta. There was then another duplication event, not just of alpha or beta globin genes, but of the entire gene sequence containing both globin type genes. The haemoglobin molecule now consists of two alpha globins and two beta globins, tightly bonded. What is more, we can say firstly that there has been no mutation of these since the second duplication event, otherwise the two alphas or two betas would be different to each other, and secondly that the latter duplication event must be extremely ancient indeed, as the haemoglobin molecule has the same form in almost every vertebrate species, so clearly must have occurred prior to the speciation events that gave rise to the large majority of species.

All of this information is detected through a combination of the molecular clock and inspection of the DNA sequence of humans and related animals. The actual genetic information is all present in the genome itself, requiring only smart detective work comparing the human genome to similar species to detect approximate dates of these events.


* Think about it - if a certain event has 1 in x change of occurring, then the chance of it happening to two is 1 in x squared.

Some resources

Useful reference materials:
  1. 1. Microsoft Word bytes Tony Blair in the butt
  2. 2. Was the Microsoft 1999 Annual report produced on a Macintosh?.
  3. 3. WD97: How to Minimize Metadata in Microsoft Word Documents
  4. 4. Section 1.6 of MS-Word is Not a document exchange format
  5. 5. Beware Your Trail of Digital Fingerprints
  6. 6. How to minimize metadata in Word 2003
  7. 7. Microsoft has produced a sales document explaining why MS Office is better than OpenOffice
  8. 8. Downing St. document plagiarised
  9. 9. Office 2003/XP Add-in: Remove Hidden Data
  10. 10. Remove hidden data and personal information from Office documents
  11. 11. MS DOC Metadata Viewer.
    Note that this appears in a forum dedicated to Computer Forensics.
  12. 12. Payne Group
  13. 13. Understanding PDF Metadata
  14. 14. Beware the Dangers of Metadata (pages 36-37).
  15. 15. Richard Dawkins, The Ancestors Tale.
  16. 16. NSA SNAC (December 13, 2005) (PDF). Redacting with Confidence: How to Safely Publish Sanitized Reports Converted From Word to PDF. Report# I333-015R-2005. Information Assurance Directorate, National Security Agency. Retrieved on 2006-05-29.
  17. 17. Google Responds To EU: Cutting Raw Log Retention Time; Reconsidering Cookie Expiration

Last update hla 2009-05-17 (reference 17 added)