A similar principle can be applied to what might be called "genetic forensics" where historic genetic events can be discovered through investigating the human genome.
There are many opportunities to discover information from files, that were not intended for public viewing. This generally occurs in the file's metadata which is non-viewing material, usually inserted by the editing application for purposes of version control, revision tracking and recording system information, but some of it also records information explicitly inserted by the author, such as comments.
These problems almost always occur because of the so-called WYSIWYG editing tools, where purportedly "What You See Is What You Get". However, nothing could be further from the truth - when you look inside the files created and maintained by these editing applications, when it becomes obvious that "what you get" is rather more than "what you see". The unfortunate author judges the file's suitability for release based on the "what you see" but is not aware of all the additional information that leaks out through the metadata.
For example, the UK government published a document, the famous "dossier", that supported their position regardinging UK involvement in the USA-Iraq war. This document was discovered to be in part plagiarised from a graduate student's work . Unfortunately for the authors, it was released as a Microsoft Word document and contained a quantity of metadata that revealed enough information to identify the Government staff who had been involved in the plagiarism .
Sometimes the information leaked through metadata can cause embarrassment rather than showing evidence of misconduct. Microsoft has had its share of embarrassment, with examples including the allegation that the Microsoft 1999 Annual Report was produced on a Mac rather than a Windows system , and that the pdf of a Microsoft marketing document was "a pdf produced on a Macintosh using QuarkExpress as can be seen by looking at the last few lines of the pdf code" .
Other stories involve the leakage of information to customers and clients that cause ill-feeling, such as sending in quotes for services or goods where the metadata reveals that the quote is a revision of quotes sent to others, with much lower prices. Similarly, private information about who is recruiting through a consulting company has been leaked through metadata .
These stories all show the importance of metadata for knowledge discovery, sometimes for the purpose of investigating criminal or otherwise prohibited activity. There are now complete industries dedicated to metadata removal (e.g. ) and professionals in many fields are being advised to "scrub" their files before releasing them not just publicly, but to customers (e.g. ).
We will now look at how metadata can be retrieved.
Other data such as custom properties are also visible, such as this job advertisement:
More detailed data is contained, and if you have access to data viewing tools (almost none seem to be available for the Macintosh!), it is possible to view further information. For example the last 10 revisions to the Iraq dossier can be revealed, showing all four of the authors involved in the creation of the document (and presumably also involved in the lifting of words from previously-published works without attribution) :
(Refer directly to WD97: How to Minimize Metadata in Microsoft Word Documents)
Microsoft has software additions to remove such metadata prior to publishing the document  . There are also numerous third-party "data scrubbing" applications available commercially for MS Office and PDF. However there are those who suggest that the best way to remove any potentially embarrassing data is to ensure that files are only ever released as PDF or HTML , since Microsoft's "save as" facility apparently does not duplicate edit histories in these formats. Note however that it is still possible to recover some information such as author and creation data from PDF files . See for example the basic author, date and system details from a PDF created by the course convenor:
Even HTML files can sometimes contain commented code or metadata that the authors did not mean to be available. However more professional companies tend to have less such leakage, although one can still see that UniSA tracks usage of their pages:
... and you can see instructions to programmers at:
Some further suggestions for viewing and removing metadata, within MS Office files and PDFs can be found in the reference list below, e.g.  . However it seems that very few of the solutions are offered for the Mac, including Office 2008 for Mac.
By looking at common DNA sequences and, more importantly, differences, it is possible to reconstruct a plausible history of how humans are biologically related to all other living organisms. In addition, by using what is called "the molecular clock" (along with other dating systems like carbon dating ad ardioactive dating of rocks) it is possible to estimate the approximate time (i.e. number of generations) that any mutation (i.e. branching of species) occurred. The molecular clock essentially characterises the average rate of mutation or variability in DNA during cell division, so an approximate number of generations can be worked out for any form of mutation to be probable. Interestingly, this rate of mutation appears to be largely independent of species.
In addition to discovering inter-species relationships and approximately dating species divisions, it is also possible to date certain events in genetic (and human) history. Two such events are the appearance of two different sorts of Haemoglobin, or red blood cells (which carry oxygen around the body) and the appearance of colour vision.
Both of these mutations are what is called "duplication events" where the copying of DNA from a cell to its division (sister cell) accidentally makes two copies of part-sequences of DNA. It is believed that duplication events are responsible for the growth in how much DNA is contained in cells (although a lot is also "absorbed" from bacterial DNA, some of which is known as mitochondrial DNA). Dawkins explains these duplication events as follows :
New genes aren't added to the genome out of thin air. They originate as duplications of older genes. Then, over evolutionary time, they go their separate ways by mutation, selection and drift. We don't usually see this happening but, like detectives arriving on the scene after a crime, we can piece together what must have happened from the evidence that remains.
We all have photosensitive ("light-sensitive") cells inside our eyes called cones. Each cone is photosensitive to a small range within the visible light spectrum, for example the "green" cones being most responsive to the green part of the light spectrum, the "blue" cones being most responsive to the violet-blue end of the spectrum and the "red" cones most responsive to the orange-red part of the spectrum.
At one stage, the ancestor species of humans (and many other related species) was only dichromatic, sensitive only to two distinct colours (green and blue). Many mammals today are still only dichromatic. However humans became trichromatic (sensitive to three colours, red green and blue) as a result firstly of a duplication event, where the gene responsible for cones seeing green was duplicated, so that there were two "green" genes, as well as the one "blue" gene. Then through a gradual process of mutation and natural selection, one of the duplicated "green" genes mutated so that the range of its photosensitivity shifted along the visible light spectrum, so that the cones it generated were now no longer (as) sensitive to green, but became more sensitive to the colour in the middle of its new spectral range, red.
All of this information is detected through a combination of the molecular clock and inspection of the DNA sequence of humans and related animals. The actual genetic information is all present in the genome itself, requiring only smart detective work comparing the human genome to similar species to detect approximate dates of these events.
* Think about it - if a certain event has 1 in x change of occurring, then the chance of it happening to two is 1 in x squared.