Vous êtes sur la page 1sur 8

Google Scholarly

http://scholar.google.com/intl/en/scholar/publishers.html

Inclusion Guidelines
Add Google Scholar
to your site

Inclusion Guidelines for Webmasters


This document describes the technical details of indexing of websites with scholarly articles in Google
Scholar. Its intended audience is webmasters who would like their papers included in Google Scholar
search results. See this page for a less technical overview of the indexing process.
If you're an individual author, it works best to simply upload your paper to your website, e.g.,
www.example.edu/~professor/jpdr2009.pdf; and add a link to it on your publications
page, such as www.example.edu/~professor/publications.html. Make sure that (a) the
full text of your paper is in a PDF file that ends with ".pdf", (b) the title of the paper appears in a large
font on top of the first page, (c) the authors of the paper are listed right below the title on a line by
itself, and (d) there's a bibliography section titled, e.g., "References" or "Bibliography" at the end.
That's it! Our search robots should normally find your paper and include it in Google Scholar within
several weeks. If it doesn't work, you could either (1) read more detailed technical guidelines below or
(2) check if your local institutional repository is already configured for indexing in Google Scholar,
and upload your papers there.

Content Guidelines
1. The content hosted on your website must consist primarily of
scholarly articles - journal papers, conference papers,
technical reports, or their drafts, dissertations, pre-prints,
post-prints, or abstracts. Content such as news or magazine
articles, book reviews, and editorials is not appropriate for
Google Scholar. Documents larger than 5MB, such as books
and long dissertations, should be uploaded to Google Book
Search; Google Scholar automatically includes scholarly works
from Google Book Search.
2. Users click through to your website to read your articles. To
be included, your website must make either the full text of
the articles or their complete author-written abstracts freely
available and easy to see when users click on your URLs in
Google search results. Your site must not require users (or
search robots) to sign in, install special software, accept
disclaimers, dismiss popup or interstitial advertisements, click
on links or buttons, or scroll down the page before they can
read the entire abstract of the paper. Sites that show login
pages, error pages, or bare bibliographic data without
abstracts will not be considered for inclusion and may be
removed from Google Scholar.

Crawl Guidelines
Google Scholar uses automated software, known as "robots" or "crawlers", to fetch your files for
inclusion in the search results. It operates similarly to regular Google search. Your website needs to be
structured in a way that makes it possible to "crawl" it in this manner. In particular:
1. Your articles need to be either in the HTML or in the PDF file
format. PDF files must have searchable text, i.e., you must be
able to search for and find words in the document using
Adobe Acrobat Reader. Each file must not exceed 5MB in size.
To index larger files, or to index scanned images of pages that
require OCR, please upload them to Google Book Search.
2. If you're hosting a small collection of publications, such as
papers written by a single author or a small group, then we
recommend that you list them on a single HTML page, such as
www.example.edu/~professor/publications.html, and include links
to their full text in the PDF format. This arrangement of files
will make it easy for the search robots to find all of your
papers.
3. If you're hosting a larger collection of papers, such as a
university-wide repository, then it would not be practical to list
all of the articles on a single webpage. In this case, all the
papers must be discoverable from the homepage of your
collection by following at most ten simple HTML links. The
use of Flash, JavaScript, or form-based navigation makes it
hard for our automated system to find your articles. If your
site uses these types of navigation, please add a "browse by
date" interface that uses simple HTML links, as described
below.
4. If your website has thousands of papers or more, the best way
to make sure they're all discovered by the search robots is to
provide a way to list them by the date of publication or the
date of record entry. Other forms of browse interfaces, such
as browse by author or by keyword, often generate more
browse URLs than the website can actually provide to the
search robots. To ensure that the robots can find all of your
articles, we recommend that the total number of intermediate
"browse" URLs does not exceed 10% of the number of article
URLs. We also recommend that the browse interface require
no more than ten clicks from the starting point of the
collection to reach any given paper.
5. For larger websites, we recommend that you create an
additional browse interface that lists only the articles added in
the last two weeks. This smaller set of pages can be recrawled
more frequently than your entire browse interface, which will
facilitate timely coverage of your recent papers by the search
robots. New additions can be listed in either the HTML or the
RSS file format.
6. Since Google refers users to your website to read the papers,
your webpages must be available to both users and crawlers
at all times. The search robots will visit your webpages
periodically in order to pick up the updates, as well as to
ensure that your URLs are still available. If search robots are
unable to fetch your webpages, e.g., due to server errors,
misconfiguration, or an overly slow response from your
website, some or all of your articles could drop out of Google
and Google Scholar. Use HTTP 5xx codes to indicate
temporary errors that should be retried soon, such as
temporary shortage of backend capacity. Use HTTP 4xx codes
to indicate permanent errors that should not be retried for
some time, such as inability to find the file in question. If you
need to move your articles to new URLs, set up HTTP 301
redirects from the old location of each article to the new one.
7. If your website uses a robots.txt file, e.g.,
www.example.com/robots.txt, it must not block Google's search
robots from accessing your articles or your browse URLs.
Conversely, it should block robots from accessing large
dynamically generated spaces that aren't useful in the
discovery of your articles, such as shopping carts, comment
forms, or results of your own keyword search. E.g., to let
Google's robots access all URLs on your site, add the following
section to you robots.txt:
User-agent: Googlebot Allow: /
Or, to block all robots from adding books to your shopping
cart, add the following:
User-agent: * Disallow: /add_cart.php

Please refer to http://www.robotstxt.org/ for more information


about robots.txt files.

Indexing Guidelines
Google Scholar uses automated software, known as "parsers", to identify bibliographic data of your
papers, as well as references between the papers. Incorrect identification of bibliographic data or
references will lead to poor indexing of your site. Some documents may not be included at all, some
may be included with incorrect author names or titles, and some may rank lower in the search results,
because their (incorrect) bibliographic data would not match (correct) references to them from other
papers. To avoid such problems, you need to provide bibliographic data and references in a way that
automated "parser" software can process.
1. If you're using repository or journal management software,
such as Eprints, DSpace, Digital Commons or OJS, please
configure it to export bibliographic data in HTML "<meta>" tags.
Google Scholar supports Highwire Press tags (e.g.,
citation_title), Eprints tags (e.g., eprints.title), BE Press tags
(e.g., bepress_citation_title), and PRISM tags (e.g.,
prism.title). Use Dublin Core tags (e.g., DC.title) as a last
resort - they work poorly for journal papers because Dublin
Core doesn't have unambiguous fields for journal title,
volume, issue, and page numbers. To check that these tags
are present, visit several abstracts and view their HTML
source.
A. The title tag, e.g., citation_title or DC.title, must
contain the title of the paper. Don't use it for the title of
the journal or a book in which the paper was published,
or for the name of your repository. This tag is required
for inclusion in Google Scholar.
B. The author tag, e.g., citation_author or DC.creator, must
contain the authors (and only the actual authors) of the
paper. Don't use it for the author of the website or for
contributors other than authors, e.g., thesis advisors.
Author names can be listed either as "Smith, John" or as
"John Smith". Put each author name in a separate tag
and omit all affiliations, degrees, certifications, etc. from
this field. At least one author tag is required for
inclusion in Google Scholar.
C. The date tag, e.g., citation_date or DC.issued, must
contain the date of publication, i.e., the date that would
normally be cited in references to this paper from other
papers. Don't use it for the date of entry into the
repository - that should go into citation_online_date
instead. Provide full dates in the "2010/5/12" format if
available; or a year alone otherwise. This tag is required
for inclusion in Google Scholar.
D. For journal and conference papers, provide the
remaining bibliographic citation data in the following
tags: citation_journal_title or citation_conference_title,
citation_issn, citation_isbn, citation_volume,
citation_issue, citation_firstpage, and citation_lastpage.
Dublin Core equivalents are DC.relation.ispartof for
journal and conference titles and the non-standard tags
DC.citation.volume, DC.citation.issue, DC.citation.spage
(start page), and DC.citation.epage (end page) for the
remaining fields. Regardless of the scheme chosen,
these fields must contain sufficient information to
identify a reference to this paper from another
document, which is normally all of: (a) journal or
conference name, (b) volume and issue numbers, if
applicable, and (c) the number of the first page of the
paper in the volume (or issue) in question.
E. For theses, dissertations, and technical reports, provide
the remaining bibliographic citation data in the following
tags: citation_dissertation_institution,
citation_technical_report_institution or DC.publisher for
the name of the institution and
citation_technical_report_number for the number of the
technical report. As with journal and conference papers,
you need to provide sufficient information to recognize a
formal citation to this document from another article.
F. If the value for a tag includes quotes (e.g., "Walk in my
shoes" - a critical analysis), you must ecape the quotes
(e.g., &quot;Walk in my shoes&quot; - a critical
analysis).
G. The "<meta>" tags normally apply only to the exact page
on which they're provided. If this page shows only the
abstract of the paper and you have the full text in a
separate file, e.g., in the PDF format, please specify the
locations of all full text versions using citation_pdf_url
or DC.identifier tags. The content of the tag is the
absolute URL of the PDF file; for security reasons, it
must refer to a file in the same subdirectory as the
HTML abstract. Failure to link the alternate versions
together could result in the incorrect indexing of the
PDF files, because these files would be processed as
separate documents without the information contained
in the meta tags.
Example:
<meta name="citation_title" content="The testis isoform of the
phosphorylase kinase catalytic subunit (PhK-T) plays a critical
role in regulation of glycogen mobilization in developing
lung"> <meta name="citation_author" content="Liu, Li"> <meta
name="citation_author" content="Rannels, Stephen R."> <meta
name="citation_author" content="Falconieri, Mary"> <meta
name="citation_author" content="Phillips, Karen S."> <meta
name="citation_author" content="Wolpert, Ellen B."> <meta
name="citation_author" content="Weaver, Timothy E."> <meta
name="citation_date" content="1996/05/17"> <meta
name="citation_journal_title" content="Journal of Biological
Chemistry"> <meta name="citation_volume" content="271"> <meta
name="citation_issue" content="20"> <meta
name="citation_firstpage" content="11761"> <meta
name="citation_pdf_url"
content="http://www.example.com/content/271/20/11761.full.pdf">
Keep in mind that, regardless of the meta-tag scheme chosen, you need to provide at least
three fields: (1) the title of the article, (2) the full name of at least the first author, and (3) the
year of publication. Pages that don't provide any one of these three fields will be processed as
if they had no meta tags at all. Likewise, webpages that don't contain the metatags themselves
and aren't linked from other pages that do contain the meta tags using citation_pdf_url
or DC.identifier, including both HTML and PDF files, will also be processed as
independent documents without the meta tags.
2. If it's not practical for you to implement the HTML "<meta>"
tags, e.g., if your papers are only available in the PDF format,
then the document needs to be visually laid out according to
the following conventions.
A. The title of the paper must be the largest chunk of text
on top of the page. Either use font size of at least 24 pt.
in PDF, or place the title inside an "<h1>" or an "<h2>" tag
in HTML, or use a CSS class named "citation_title".
Make sure that all other text on the page, in particular
the name of the repository or the journal, is set in a
smaller font than the title of the paper - otherwise, this
other, larger, text may be incorrectly interpreted as the
title of the paper.
B. The authors of the paper must be listed right before or
right after the title, in a slightly smaller font that is still
larger than normal text. Either use a 16-23 pt. font in
PDF, or place the authors inside an "<h3>" tag in HTML,
or wrap them in a CSS class named "citation_author".
Make sure the names of the repository and the journal,
as well as the text of the section headings, are set in a
smaller font than the authors of the paper - otherwise,
this other, larger, text may be incorrectly interpreted as
the authors. Use "Sentence case" as opposed to "Title
Case" for section headings et. al., to avoid confusion
with author names. Separate multiple author names
with commas or semicolons and omit their affiliations,
degrees, and certifications from the author line. Use an
explicit format such as "by John Smith" or "Author: John
Smith", if appropriate.
C. Include a bibliographic citation to a published version of
the paper on a line by itself, and place it inside the
header or the footer of the first page in the PDF file, or
next to the title and the authors in HTML. Use an explicit
citation format, e.g.: "J. Biol. Chem., vol. 234, no. 8, pp.
1971-1975, August 1959". If the paper is unpublished,
include the full date of its present version on a line by
itself, e.g., "August 12, 2009".
D. Avoid use of Type 3 fonts in PDF files, because they're
often generated with missing or incorrect font size and
character encoding information, which makes it difficult
for our parser software to extract the bibliographic data.
You can check the types of the fonts under the File -
> Properties... menu in Adobe Acrobat Reader. If you're
using LaTeX, consider switching to Type 1 fonts, e.g.,
\usepackage{times}, \usepackage{helvet}, or
\usepackage{palatino}.
Please understand that it's not possible for our automated parsers to correctly identify
bibliographic data in such free-form formats with 100% accuracy; and that failure to correctly
identify certain fields can lead to exclusion of your papers from Google Scholar. If you're not
satisfied with the accuracy of your Google Scholar results, you need to create HTML pages
with abstracts and add "<meta>" tags to them, as described above.
3. Place each article and each abstract in a separate HTML or
PDF file. At this time, we're unable to effectively index
multiple abstracts on the same page or multiple papers in the
same PDF file. Likewise, we're unable to index different
sections of the same paper in different files. Each paper must
be in its own unique file in order for it to be included in Google
Scholar.
4. Mark the section of the paper that contains references to
other works with a standard heading, such as "References" or
"Bibliography", on a line just by itself. Individual references
inside this section should be either numbered "1. - 2. - 3." or
"[1] - [2] - [3]" in PDF, or put inside an "<ol>" list in HTML. The
text of each reference must be a formal bibliographic citation
in a commonly used format, without free-form commentary.
Please understand that the references are identified automatically by the parser software;
they're not entered or corrected by human operators. While we try to support most common
reference formats, it is not possible to guarantee that all references are identified correctly;
and incorrect identification of references could lead to exclusion of your papers from Google
Scholar or to low ranking of your papers in the search results.

Asked Questions
Common Questions
1. I'm a publisher of scholarly works and would like to have my
content included in Google and Google Scholar?
2. I publish scholarly textbooks and monographs. Can my
content be included in Google Scholar?
3. Can I see usage statistics for my content?
4. What do I do if I believe you're linking to a webpage that
infringes my copyright?
Technical Questions
1. My articles are in PDF format. Can you still index my site?
2. How can I tell if a PDF file has searchable text?
3. Some of my articles are split into multiple files, one file per
section. Can you work with these?
4. How do I remove a listing from your search results?
5. I see a 'cached' (or 'View as HTML') link for my access-
controlled articles. I need to have this fixed right away!
6. Is there anything I can do to help rank my articles better?
7. All my articles are available to your crawlers, but not all of
them seem to show up in Google Scholar. Can I do something
to help improve coverage?

Common Questions
1. I'm a publisher of scholarly works and would like to
have my content included in Google and Google
Scholar?

Your content is most welcome. As noted above, an abstract


(at least) of each work must be available to non-subscribers
who come from Google and Google Scholar. Please configure
your website according to our technical inclusion guidelines,
and then contact us to consider it for inclusion in Google
Scholar.

2. I publish scholarly textbooks and monographs. Can my


content be included in Google Scholar?

Maybe. For now, Google Scholar indexes mostly scholarly


articles. For textbooks and monographs, we recommend
Google Book Search. Google Scholar automatically includes
scholarly works from Google Book Search.

3. Can I see usage statistics for my content?

Since users click through to your website, your web server


logs should have all the usage statistics.

4. What do I do if I believe you're linking to a webpage


that infringes my copyright?

It is our policy to respond to notices of alleged infringement


that comply with the Digital Millennium Copyright Act. For
directions and more information, please click here.
Technical Questions
1. My articles are in PDF format. Can you still index my
site?

Yes. We can index PDF articles as long as they're searchable


and as long as their size doesn't exceed 5MB. For larger
documents and for scanned images that require OCR, we
recommend Google Book Search.

2. How can I tell if a PDF file has searchable text?

Open the file in Adobe Acrobat Reader. Click 'Find' (look for
the binocular icon), and confirm that you can search for and
find several words on the page.

3. Some of my articles are split into multiple files, one file


per section. Can you work with these?

Alas, we can't. We can index only one file per article at the
moment.

4. How do I remove a listing from your search results?

Refer to Google webmaster help here.

5. I see a 'cached' (or 'View as HTML') link to my access-


controlled articles. I need to have this fixed right away!

Of course! Please email us with specific examples of where


the links appear; we'll investigate and fix as soon as possible.
This is not intentional but may happen due to technical issues.
For example, our methodical crawlers may accidentally
discover a forgotten alternative interface to your content.
You'll need to tell us of all such interfaces, because crawlers
can go places where you least expect them. Please email us
and we'll look into it.

If you believe another site is infringing your copyright, please


see our directions on the DMCA process.

6. Is there anything I can do to help rank my articles


better?

Indeed you can. Our indexing algorithms automatically extract


bibliographic data, citations and other information from
articles and use it for ranking purposes. Providing
authoritative metadata about your articles can help facilitate
this and can increase the likelihood of identifying all the
citations to your articles. We strongly recommend this
approach. Please refer to the following technical inclusion
guidelines for the details of how to implement it.

7. All my articles are available to your crawlers, but not


all of them seem to show up in Google Scholar. Can I do
something to help improve coverage?

Yes. Gaps in coverage are certainly not intentional; but they


could be caused by a number of different technical issues in
the automatic processing of your website by our search
robots. The troubleshooting section in our technical inclusion
guidelines describes ways to identify and fix common
coverage issues. We encourage you to take a look.

Vous aimerez peut-être aussi