I went to SES London on Thursday and attended the seminar on duplicate content.
It was very interesting, so I thought I would share a few points on this topic. The easiest way to avoid having any duplicate content on the site is to not put any on the site.
1. The URL should not be accessed through more than one domain. If you have a test domain, make sure you exclude it in the robots.txt file.
2. www vs non www
Major search engines can deal with this, but if you have both versions available then make sure there is a redirect from one to the other. The same applies to secure and non secure page. http vs https.
3. Breadcrumb navigation that reflects in the URL is another area that can cause duplicate content issues
4. Session IDs
IDs are tabbed at the end of the URL for users who don’t support cookies. This is a big problem because every time the spider comes back to the site, it gets a different URL and then indexes a lot of pages. Google apparently have an answer on how to exclude session IDs for spiders. When I find this information, I will post it here.