Sunday, December 18, 2011

Web crawler

A web crawler is a computer program that browses the World Wide Web in a methodical, automated manner or in an orderly fashion. Other terms for Web crawlers are ants, automatic indexers, bots, Web spiders, Web robots, or—especially in the FOAF community—Web scutters.

This process is called Web crawling or spidering. Many sites, in particular search engines, use spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Crawlers can also be used for automating maintenance tasks on a Web site, such as checking links or validating HTML code. Also, crawlers can be used to gather specific types of information from Web pages, such as harvesting e-mail addresses (usually for sending spam).

A Web crawler is one type of bot, or software agent. In general, it starts with a list of URLs to visit, called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are recursively visited according to a set of policies.

The large volume implies that the crawler can only download a fraction of the Web pages within a given time, so it needs to prioritize its downloads. The high rate of change implies that the pages might have already been updated or even deleted.

The number of possible crawlable URLs being generated by server-side software has also made it difficult for web crawlers to avoid retrieving duplicate content. Endless combinations of HTTP GET (URL-based) parameters exist, of which only a small selection will actually return unique content. For example, a simple online photo gallery may offer three options to users, as specified through HTTP GET parameters in the URL. If there exist four ways to sort images, three choices of thumbnail size, two file formats, and an option to disable user-provided content, then the same set of content can be accessed with 48 different URLs, all of which may be linked on the site. This mathematical combination creates a problem for crawlers, as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content.

As Edwards et al. noted, "Given that the bandwidth for conducting crawls is neither infinite nor free, it is becoming essential to crawl the Web in not only a scalable, but efficient way, if some reasonable measure of quality or freshness is to be maintained." A crawler must carefully choose at each step which pages to visit next.

The behavior of a Web crawler is the outcome of a combination of policies:
  • a selection policy that states which pages to download,
  • a re-visit policy that states when to check for changes to the pages,
  • a politeness policy that states how to avoid overloading Web sites, and
  • a parallelization policy that states how to coordinate distributed Web crawlers.

Search engine indexing

Search engine indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, physics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing.

Popular engines focus on the full-text indexing of online, natural language documents. Media types such as video and audio and graphics are also searchable.

Meta search engines reuse the indices of other services and do not store a local index, whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time.

Web search query

A web search query is a query that a user enters into web search engine to satisfy his or her information needs. Web search queries are distinctive in that they are unstructured and often ambiguous; they vary greatly from standard query languages which are governed by strict syntax rules.

Types

There are four broad categories that cover most web search queries:

Informational queries – Queries that cover a broad topic (e.g., colorado or trucks) for which there may be thousands of relevant results.

Navigational queries – Queries that seek a single website or web page of a single entity (e.g., youtube or delta air lines).

Transactional queries – Queries that reflect the intent of the user to perform a particular action, like purchasing a car or downloading a screen saver.

Search engines often support a fourth type of query that is used far less frequently:

Connectivity queries – Queries that report on the connectivity of the indexed web graph (e.g., Which links point to this URL?, and How many pages are indexed from this domain name?).

Characteristics
  • The average length of a search query was 2.4 terms.
  • About half of the users entered a single query while a little less than a third of users entered three or more unique queries.
  • Close to half of the users examined only the first one or two pages of results (10 results per page).
  • Less than 5% of users used advanced search features (e.g., Boolean operators like AND, OR, and NOT).
  • The top four most frequently used terms were , (empty search), and, of.

Web search engine

A web search engine is designed to search for information on the World Wide Web and FTP servers. The search results are generally presented in a list of results often referred to as SERPS, or "search engine results pages". The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike web directories, which are maintained only by human editors, search engines also maintain real-time information by running an algorithm on a web crawler.

How web search engines work

A search engine operates in the following order:
  1. Web crawling
  2. Indexing
  3. Searching
Web search engines work by storing information about many web pages, which they retrieve from the html itself. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link on the site. Exclusions can be made by the use of robots.txt. The contents of each page are then analyzed to determine how it should be indexed (for example, words are extracted from the titles, headings, or special fields called meta tags). Data about web pages are stored in an index database for use in later queries. A query can be a single word. The purpose of an index is to allow information to be found as quickly as possible. Some search engines, such as Google, store all or part of the source page (referred to as a cache) as well as information about the web pages, whereas others, such as AltaVista, store every word of every page they find. This cached page always holds the actual search text since it is the one that was actually indexed, so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. This problem might be considered to be a mild form of linkrot, and Google's handling of it increases usability by satisfying user expectations that the search terms will be on the returned webpage. This satisfies the principle of least astonishment since the user normally expects the search terms to be on the returned pages. Increased search relevance makes these cached pages very useful, even beyond the fact that they may contain data that may no longer be available elsewhere.
When a user enters a query into a search engine (typically by using key words), the engine examines its index and provides a listing of best-matching web pages according to its criteria, usually with a short summary containing the document's title and sometimes parts of the text. The index is built from the information stored with the data and the method by which the information is indexed. Unfortunately, there are currently no known public search engines that allow documents to be searched by date. Most search engines support the use of the boolean operators AND, OR and NOT to further specify the search query. Boolean operators are for literal searches that allow the user to refine and extend the terms of the search. The engine looks for the words or phrases exactly as entered. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. There is also concept-based searching where the research involves using statistical analysis on pages containing the words or phrases you search for. As well, natural language queries allow the user to type a question in the same form one would ask it to a human. A site like this would be ask.com.
The usefulness of a search engine depends on the relevance of the result set it gives back. While there may be millions of web pages that include a particular word or phrase, some pages may be more relevant, popular, or authoritative than others. Most search engines employ methods to rank the results to provide the "best" results first. How a search engine decides which pages are the best matches, and what order the results should be shown in, varies widely from one engine to another. The methods also change over time as Internet usage changes and new techniques evolve. There are two main types of search engine that have evolved: one is a system of predefined and hierarchically ordered keywords that humans have programmed extensively. The other is a system that generates an "inverted index" by analyzing texts it locates. This second form relies much more heavily on the computer itself to do the bulk of the work.
Most Web search engines are commercial ventures supported by advertising revenue and, as a result, some employ the practice of allowing advertisers to pay money to have their listings ranked higher in search results. Those search engines which do not accept money for their search engine results make money by running search related ads alongside the regular search engine results. The search engines make money every time someone clicks on one of these ads.

Search engine optimization (SEO)

Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search engines via the "natural" or un-paid ("organic" or "algorithmic") search results. In general, the earlier (or higher ranked on the search results page), and more frequently a site appears in the search results list, the more visitors it will receive from the search engine's users. SEO may target different kinds of search, including image search, local search, video search, academic search, news search and industry-specific vertical search engines.

As an Internet marketing strategy, SEO considers how search engines work, what people search for, the actual search terms typed into search engines and which search engines are preferred by their targeted audience. Optimizing a website may involve editing its content and HTML and associated coding to both increase its relevance to specific keywords and to remove barriers to the indexing activities of search engines. Promoting a site to increase the number of backlinks, or inbound links, is another SEO tactic.

The acronym "SEOs" can refer to "search engine optimizers," a term adopted by an industry of consultants who carry out optimization projects on behalf of clients, and by employees who perform SEO services in-house. Search engine optimizers may offer SEO as a stand-alone service or as a part of a broader marketing campaign. Because effective SEO may require changes to the HTML source code of a site and site content, SEO tactics may be incorporated into website development and design. The term "search engine friendly" may be used to describe website designs, menus, content management systems, images, videos, shopping carts, and other elements that have been optimized for the purpose of search engine exposure.

Methods

Getting indexed, Preventing crawling, Increasing prominence

Image search optimization

Image search optimization is the process of organizing the content of a webpage to increase relevance to a specific keyword on image search engines. Like search engine optimization, the aim is to achieve a higher organic search listing and thus increasing the volume of traffic from search engines.
Image search optimization techniques can be viewed as a subset of search engine optimization techniques that focuses on gaining high ranks on image search engine results.
Unlike normal SEO process, there isn't much to do for ISO. Making high quality images accessible to search engines and providing some description about images is almost all that can be done for ISO. 

As a marketing strategy

SEO is not an appropriate strategy for every website, and other Internet marketing strategies can be more effective, depending on the site operator's goals. A successful Internet marketing campaign may also depend upon building high quality web pages to engage and persuade, setting up analytics programs to enable site owners to measure results, and improving a site's conversion rate.

SEO may generate an adequate return on investment. However, search engines are not paid for organic search traffic, their algorithms change, and there are no guarantees of continued referrals. Due to this lack of guarantees and certainty, a business that relies heavily on search engine traffic can suffer major losses if the search engines stop sending visitors. It is considered wise business practice for website operators to liberate themselves from dependence on search engine traffic. Seomoz.org has suggested that "search marketers, in a twist of irony, receive a very small share of their traffic from search engines." Instead, their main sources of traffic are links from other websites.

International markets

Optimization techniques are highly tuned to the dominant search engines in the target market. The search engines' market shares vary from market to market, as does competition. In 2003, Danny Sullivan stated that Google represented about 75% of all searches.[49] In markets outside the United States, Google's share is often larger, and Google remains the dominant search engine worldwide as of 2007. As of 2006, Google had an 85-90% market share in Germany. While there were hundreds of SEO firms in the US at that time, there were only about five in Germany. As of June 2008, the marketshare of Google in the UK was close to 90% according to Hitwise. That market share is achieved in a number of countries.

As of 2009, there are only a few large markets where Google is not the leading search engine. In most cases, when Google is not leading in a given market, it is lagging behind a local player. The most notable markets where this is the case are China, Japan, South Korea, Russia and the Czech Republic where respectively Baidu, Yahoo! Japan, Naver, Yandex and Seznam are market leaders.
Successful search optimization for international markets may require professional translation of web pages, registration of a domain name with a top level domain in the target market, and web hosting that provides a local IP address. Otherwise, the fundamental elements of search optimization are essentially the same, regardless of language.

nofollow

nofollow is a value that can be assigned to the rel attribute of an HTML a element to instruct some search engines that a hyperlink should not influence the link target's ranking in the search engine's index. It is intended to reduce the effectiveness of certain types of search engine spam, thereby improving the quality of search engine results and preventing spamdexing from occurring.

Example:
<a href="http://www.example.com/" rel="nofollow">Link text</a>

Control internal PageRank flow

Search engine optimization professionals started using the nofollow attribute to control the flow of PageRank within a website, but Google since corrected this error, and any link with a nofollow attribute decreases the PageRank that the page can pass on. This practice is known as "PageRank sculpting". This is an entirely different use than originally intended. nofollow was designed to control the flow of PageRank from one website to another. However, some SEOs have suggested that a nofollow used for an internal link should work just like nofollow used for external links.

Several SEOs have suggested that pages such as "About Us", "Terms of Service", "Contact Us", and "Privacy Policy" pages are not important enough to earn PageRank, and so should have nofollow on internal links pointing to them. Google employee Matt Cutts has provided indirect responses on the subject, but has never publicly endorsed this point of view.

The practice is controversial and has been challenged by some SEO professionals, including Shari Thurow and Adam Audette. Site search proponents have pointed out that visitors do search for these types of pages, so using nofollow on internal links pointing to them may make it difficult or impossible for visitors to find these pages in site searches powered by major search engines.
Although proponents of use of nofollow on internal links have cited an inappropriate attribution to Matt Cutts as support for using the technique, Cutts himself never actually endorsed the idea. Several Google employees (including Matt Cutts) have urged Webmasters not to focus on manipulating internal PageRank. Google employee Adam Lasnik has advised webmasters that there are better ways (e.g. click hierarchy) than nofollow to "sculpt a bit of PageRank", but that it is available and "we're not going to frown upon it".
No reliable data has been published on the effectiveness or potential harm that use of nofollow on internal links may provide. Unsubstantiated claims have been challenged throughout the debate and some early proponents of the idea have subsequently cautioned people not to view the use of nofollow on internal links as a silver bullet or quick-success solution.
More general consensus seems to favor the use of nofollow on internal links pointing to user-controlled pages which may be subjected to spam link practices, including user profile pages, user comments, forum signatures and posts, calendar entries, etc.
YouTube, a Google company, uses nofollow on a number of internal 'help' and 'share' links

Affiliate marketing

Where the directory earns commission for referred customers from the listed websites

Affiliate marketing is a marketing practice in which a business rewards one or more affiliates for each visitor or customer brought about by the affiliate's own marketing efforts. Examples include rewards sites, where users are rewarded with cash or gifts, for the completion of an offer, and the referral of others to the site. The industry has four core players: the merchant (also known as 'retailer' or 'brand'), the network (that contains offers for the affiliate to choose from and also takes care of the payments), the publisher (also known as 'the affiliate'), and the customer. The market has grown in complexity to warrant a secondary tier of players, including affiliate management agencies, super-affiliates and specialized third party vendors.
Affiliate marketing overlaps with other Internet marketing methods to some degree, because affiliates often use regular advertising methods. Those methods include organic search engine optimization (SEO), paid search engine marketing (PPC - Pay Per Click), e-mail marketing, and in some sense display advertising. On the other hand, affiliates sometimes use less orthodox techniques, such as publishing reviews of products or services offered by a partner.
Affiliate marketing—using one website to drive traffic to another—is a form of online marketing, which is frequently overlooked by advertisers. While search engines, e-mail, and website syndication capture much of the attention of online retailers, affiliate marketing carries a much lower profile. Still, affiliates continue to play a significant role in e-retailers' marketing strategies