Search engine development trends. Search engines Prospects for the development of search engines
Search engine ranking algorithms are constantly evolving and improving. The main goals of this development are to provide High Quality search for users and the creation of maximum difficulties for manipulation search results website optimizers.
These goals are interrelated, since the quality of the search directly depends on the ability or impossibility of influencing it by interested parties.
When the search engines Yandex and Google were just starting their development, their ranking algorithms were primitive, which made it quite easy to manipulate them. Page relevance was highly influenced by the following factors: meta tags, density keywords on the page and highlight tags. However, this allowed "black" optimizers, who promoted sites focused not on people, but on search engines in order to capitalize on the flow of visitors, to worsen the overall quality of the search.
As a result, search engines no longer take into account the Keywords meta tag and, apparently, the Description tag, which is now used only to form the snippet ( short description pages) on Google. The importance of other internal optimization factors, which allowed maliciously manipulating search results, also decreased.
Then the optimizers found that the number external links on the site, as well as their anchors affect the position of the site in the search results. Thousands of site directories and programs for automatic submission to them immediately appeared (the most famous program of this kind is AllSubmitter).
Search engines rather quickly excluded most of the site directories, dramatically reducing the efficiency of runs through directories, which began to be massively used by optimizers.
After that, effective attempts to manipulate the SERPs began to mainly consist in buying links from regular sites that were not created on directory scripts.
Very soon, search engines learned to recognize the rough work of selling links and imposed sanctions in the form of a filter or ban for sites created solely for selling links. Moreover, in some cases, sanctions may apply to sites to which links are purchased.
All stages of development of search engines represent the following logical chain:
1. Some basic ranking algorithm is created.
2. Optimizers identify weaknesses in it and begin to massively manipulate search results.
3. Search engines seriously adjust the ranking algorithm, changing the degree of influence of certain factors.
4. Optimizers analyze these changes, adapt to new conditions, and again begin to manipulate the search massively.
However, search engine ranking algorithms in Lately not only change the significance of various factors, but also change qualitatively in general.
Comprehensive consideration of hundreds of different factors becomes relevant, and a single ranking formula is being abolished, instead of which a matrix system is being used. An example of this is the Yandex algorithm "Snezhinsk" (a description of this algorithm is given on the page http://seo-in.ru/poiskovaya-optimizaciya/62-snezhinsk.html).
By new system, for each individual request, its own ranking formula is generated, which may be completely different from the ranking formula for other requests. If earlier it was quite easy to identify some common dependencies in the principles of search engine ranking, then in the future there will simply be no common dependencies.
Paid tools for website promotion will most likely remain, but their use will most likely become uneconomical. This is exactly the situation that is observed now in the English-speaking sector of the Internet.
In the near future, a combination of the following main factors will have the greatest effect on website promotion:
- a large array of quality content (unique and useful);
- site trust;
- site age;
- reasonable internal optimization.
Any special technical advance based on identifying weaknesses in ranking algorithms is likely to lose relevance. At least that's what it's all about.
KOVROV STATE TECHNOLOGICAL ACADEMY
Information and analytical information on informatics
on the topic: “Modern search engines, development trends of one of the Yandex market leaders”.
Completed by: 1st year student
3 academic groups
Makarov Ivan
Introduction. 3
Main part. 4
Conclusion. eleven
Introduction.
Yandex is a Russian IT company that owns a search engine of the same name on the Web and an Internet portal. The Yandex search engine is the eighth largest search site in the world in terms of the number of processed search queries (1.290 billion, statistics for August 2009) and the second largest non-English search server after the Chinese Baidu.
The company's website was opened on September 23, 1997. 2000 is the year of the formation of Yandex. Yandex was founded by CompTek (the company that developed the Yandex search engine and supported it). The company reached self-sufficiency in 2002, turnover for 2006 - 72.6 million dollars, net profit - 29.9 million, for 2005 - 35.6 million dollars, net profit - 13.6 million.
The main and priority direction of the company is the development of a search engine, but over the years Yandex has become a multi-portal. In 2009, Yandex has more than 30 services. The most popular are: Yandex.News, Yandex.Fotki, Yandex.Toys and others.
The main office of the company is located in Moscow. The company has offices in St. Petersburg, Yekaterinburg, Odessa, Simferopol and Kyiv. In mid-June 2008, the company announced the opening of Yandex Labs - an office in the US, California.
Main part.
History of the company.
The Yandex.Ru search engine was officially announced on September 23, 1997 at the Softool exhibition. The main distinguishing features of Yandex.Ru at that time were the verification of the uniqueness of documents (the exclusion of copies in different encodings), as well as the key properties of the Yandex search engine, namely: taking into account the morphology of the Russian language (including search by exact word form), search taking into account distances (including within a paragraph, the exact phrase), and a carefully designed algorithm for assessing relevance (matching the response to a query), taking into account not only the number of query words found in the text, but also the “contrast” of the word (its relative frequency for a given document) , spacing between words, and the position of the word in the document.
A little later, in the section "Tales" (observations on the content of the Russian Internet), the first tale of the Runet appeared - "Web - humanism or chernukha?". And in the "Numbers" section - the first estimate of the volume of the Runet, 5 thousand servers and 4 GB of texts.
Two months later, in November 1997, a natural language query was implemented. From now on, Yandex.Ru can be accessed simply “in Russian”, asking long queries, for example: “where to buy a computer”, “genetically modified products” or “international telephone communication and get accurate answers. The average length of a query in Yandex.Ru is now 2.7 words. In 1997, it was 1.2 words, when search engine users were accustomed to telegraphic style.
In 1998, Yandex.Ru introduced the ability to “find a similar document”, a list of found servers, search within a given date range, and sorting search results by time. last change. During this year, the "volume" of the Russian Internet has doubled, which led to the need to optimize search engines. Both then and now (with a volume of 200 GB), the search speed on Yandex.Ru is a fraction of a second.
During 1999, the Runet grew by an order of magnitude, both in the volume of texts and in the number of users. It was a year of rapid development for Yandex.Ru as well. The new search robot made it possible to optimize and speed up the bypass of Runet sites. Today, the Yandex.Ru search base is twice as large as that of its closest competitors.
The new robot made it possible to provide users with new features - search in different text areas (headings, links, annotations, addresses, captions for pictures), limiting the search to a group of sites, searching for links and images, as well as highlight documents in Russian. There was a search in the categories of the catalog and for the first time in Runet the concept of "citation index" was introduced - the number of resources that refer to this one.
Throughout the year, work continued on the quantitative and qualitative analysis of the Runet. The NINI-index was opened (index "Inconsistency of Interests of the Population of the Internet"), showing the dynamics of changes in the interests of Internet users. A search forum and a new service have been opened - a subscription to a request, that is, you can leave your request on Yandex.Ru and regularly receive information by e-mail about the appearance of new and / or modified documents corresponding to this request. By the beginning of the school year, the "Family Yandex" was opened, filtering search results from obscene language and pornography.
The origin of the word "Yandex".
Today "Yandex" is a word from the everyday life of an Internet user. It is often found on the Web “What, Yandex has already been canceled?”, “Loneliness is when Yandex is the first to congratulate you on your birthday”, “All questions to Yandex”. Many already think that this has always been the case. In a way, this is true - Yandex really appeared simultaneously with the mass Internet, when access to the network ceased to be the lot of selected technical specialists. But the very word "Yandex" is artificial, has its own authors and its own history.
In 1993, Arkady Volozh, the future CEO of the future Yandex company, and Ilya Segalovich, the company's future technology director, developed, as it turned out later, the main technology - the search for unstructured information, taking into account the Russian language.
The development had to be named somehow. Ilya remembers how he wrote down different derivatives of words describing the meaning of technology in a column. It quickly became clear that search (“search”) in Russian sounds too dissonant and you can’t make a successful combination based on it. The word index was more appropriate. So yandex appeared in the list of names - yet another indexer ("another indexer" or Language index). Both Ilya and Arkady liked the option - it is easy to pronounce, easy to write. In addition, Arkady suggested the letter "I" in the name - specifically Russian - Russian and leave it for clarity. So the word "Yandex" was invented. And the program file, respectively, was called yandex.exe.
In 1996, when for the first time search was offered to the general public as a technology, and not as part of a content product (before that, there were the International Classifier of Inventions and the Bible Computer Reference), the line of programs was called Yandex and this name was explained as Language iNDEX. The first programs in the line were Yandex.Site (search on one of your own sites - this product is now called Yandex.Server) and Yandex.Dict (morphological prefix for AltaVista, the only search engine that at that time knew how to somehow work with Cyrillic) .
But, of course, the word "Yandex" has become widespread since September 1997, after the launch of the search engine www.yandex.ru. Since then, users of the system have been offering us their interpretations. For example, Tyoma Lebedev, preparing to draw the first version home page Yandex website, said: “Ah, I understand, if the first “I” in the word index is translated into Russian, it will be “I”, that is, this will be “Yandex”. The authors honestly admitted that they did not think about it, but - a good interpretation, is accepted. Then someone on the Web suggested another option, seeing the two sides of the Internet, INdex and YANDEX. This word has already appeared derivatives, for example, Yandex employees are often called "Yandexoids" and less often - "Yandexians".
Search "Yandex".
Yandex search allows you to search the Runet, Uanet, and Kaznet (since October 14, 2009) for documents in Russian, Ukrainian, Belarusian, Romanian, English, German and French, taking into account the morphology of Russian and English and proximity of words in a sentence. Since the beginning of 2006, Yandex search has been installed on the Mail.ru portal.
In addition to web pages HTML format, Yandex indexes documents in PDF (Adobe Acrobat), Rich Text Format (RTF), binary formats Microsoft Word, Microsoft Excel, Microsoft PowerPoint, SWF (Macromedia Flash), RSS (blogs and forums).
A distinctive feature of Yandex is the ability to fine-tune the search query. This is implemented using a flexible query language. So, for example, for the exclusion operation, you can specify the scope: the query A ~ ~ B will find documents (pages) in which A is present, but C is not present, and the query A ~ B will find documents where the word B is not present with the word A in one sentence. Similarly, the & operator looks for combinations of keywords in a sentence, while && looks for the whole document.
Operator! allows you to disable morphology for a specific word as well!! allows you to specify the normal form, which allows you to get around some problems associated with homonymy. For example, the query !!Ivanov will find Ivanov and Ivanov, but not Ivanov.
By default, Yandex displays 10 links on each results page; in the search results settings, you can increase the page size to 20, 30, or 50 found documents. Sometimes the order of the sites on these pages may differ, since the databases for these results are not updated at the same time.
If there are a lot of links found for the query, the results page suggests limiting the search range - by region (that is, by IP range) or by date. If nothing is found for any word or words, it is proposed to replace it / them with similar ones (since the proposed options depend on the frequency of finding similar words, funny situations sometimes arise). Also, it is proposed to correct the words typed in the wrong keyboard layout.
From time to time, the Yandex algorithms responsible for the relevance of the issue change, which leads to changes in the results of search queries. The last officially announced changes were in March 2004, April 2005 and January 2007; according to unofficial information, there are much more of them (for example, the last one in August-September 2007).
In particular, these changes are directed against search spam, which leads to irrelevant results for some queries (less often for entire families of queries). Against search spam, which is not automatically filtered out, semi-automatic and manual moderation of the issuance (with the help of the so-called "white hat optimizers") is used, as well as a direct refusal to index "malicious" sites.
Owners, management and performance indicators.
More than 30% of the company, according to its own data, belongs to the investment funds ru-Net Holdings and Baring Vostok Capital Partners, 15% - to the Tiger Technologies fund, about 30% - to the founders of the company and 20% - to managers and other minority shareholders.
In mid-September 2009, it became known that the parent company of Yandex, the Dutch company Yandex N.V., issued a priority share, which was transferred to Sberbank for a symbolic 1 euro. The only right that a share gives is to veto the sale of more than 25% of the company's shares.
Management: Rkady Volozh - General Director, Ilya Segalovich - Technical Director, Elena Kolmanovskaya - Editor-in-Chief, Alexei Tretyakov - Commercial Director, Svetlana Kondrashova - Advertising Director.
All Yandex services.
Information retrieval:
Search and ya.ru
Directory - a directory of websites sorted by citation index. It is replenished manually by catalog editors, there is a possibility of paid registration.
News - The top news of the day, sourced from the mainstream media featured on the Internet. It is possible to search by news, as well as subscribe to news for a given search query.
Yandex.XML - using this service, you can make automatic search queries to Yandex in xml format.
Search on blogs and forums - search for resources that have an RSS-representation, as well as a rating of current queries, popular categories and news.
Market - search for offers for the sale of goods and services, selection of models.
"Meditative" search is the only search service in the world that has a "Search" button, but no search bar.
Dictionaries - encyclopedias, reference books, translation dictionaries.
Pictures - image search.
Video - video search.
Maps - maps of Europe and Russia, maps of major cities of the Russian Federation (up to the house), search on the map, as well as the ability to "wander" through the streets of some cities. [source?]
Addresses - search for contact information by the names of firms and organizations.
Poster - information about available events: cinema, theater, concerts, sports, clubs, etc.
Weather - weather forecast.
TV program - programs of central, regional and satellite channels TV.
Timetables - timetables for trains and planes.
Personalized:
Yandex.Video - video hosting and video search.
Mail - email.
Ya.ru is a blogging service.
Yandex.Fotki - photo hosting.
Spam defense - spam filtering.
People - free hosting for personal web pages, as well as a file storage service.
Yandex money - payment system, which allows you to pay for goods and services on the Internet.
Bookmarks is a bookmark storage system integrated with Yandex. bar."
Subscriptions - subscription to news.
Feed - online RSS reader
Yandex.Direct is a system for placing contextual advertising with pay per click.
The Cup is a regular Internet search competition.
Cities - Internet indices of Russian cities.
Tariff - search by tariffs of Internet providers.
Postcards
Spring - automatic generation of philosophical essays.
Internet - measures the speed of the Internet connection.
Mirror - A mirror of major Linux OS distributions, as well as FreeBSD and other projects.
Yandex. Local network - provides an opportunity to use all Yandex services not at the federal, but at the local rate.
Metrica - allows you to measure traffic, analyze user behavior and evaluate the effectiveness of advertising campaigns.
Software products:
Spam filter Spamodefense for corporate use (paid).
A program for searching Yandex Desktop Search files on a computer.
Ya.Online instant messaging program based on Jabber. It also allows you to receive notifications about new letters from Yandex. Mail, about new events from sites Odnoklassniki.ru and VKontakte.
The Punto Switcher program is an automatic layout switcher.
Widgets for Mac OS X operating systems and Windows Vista, as well as for Opera browser: Search, Traffic, Clock, News.
Yandex ICQ - a special version of the ICQ client with symbols and integration of some services from Yandex.
Interesting facts.
1) The average length of a query in Yandex.Ru now is 2.7 words. In 1997, it was 1.2 words, when search engine users were accustomed to telegraphic style.
2) Yandex appeared before www.yandex.ru. The word Yandex was invented in 1993, and it was publicly uttered in 1996 and then meant not a company or a search engine, but a search technology on its own server and a morphological prefix to the Altavista.com search engine.
3) www.yandex.ru was launched to demonstrate the capabilities of Yandex technology, no one thought about making money on advertising.
4) The slogan “There is everything” was invented in 2000. In the same year, Yandex launched the first advertisement for the website on Russian television.
5) According to Yandex itself, about 80 percent of its audience is from Russia, about 3 percent from Europe, and just over 1 percent from the United States.
6) Some of the Yandex technical support staff operates under the collective pseudonym "Platon Shchukin".
Conclusion.
So now we have full information about Yandex. We know who manages it, how it works from the inside, what is the history of the company's development and much more. Now we can easily understand why Yandex is the leader in the Russian and global markets. I think the main reason for the success of Yandex is that the search engine copes well with the complexities of the Russian language. That is why search engines that were developed for English cannot index and rank Russian-language documents as well. The second advantage I see is the creative, friendly, cheerful slogans with which Yandex attracts users to use its services. Thematic pictures that Yandex places near its search line are much more accessible for a Russian user.
Leaders, trend growth in the number of proposals will continue. Those present today market electronic payment systems... more one milestone event: Paycash signed an agreement with the largest search engine system ...
Volga Federal District: contemporary status and prospects development(on the example of the Republic of Tatarstan)
Coursework >> Economics... trends further development. ... leader. ... development one from the most important ... complex search and aerobatic... market. Development ... contemporary technologies, high-performance equipment, contemporary... supertoxicants; - development systems land monitoring...
Modern sociological problems of physical culture and sports
Abstract >> SociologyTo promote political leaders, parties, ... the total subject-object system socio-pedagogical ... creative search engine activity... market and the state. Market ... Trends development contemporary Olympic Movement Russia is one from ...
Trends development oil industry in the global economy
Abstract >> EconomicsWorld market oil: trends development and... already carried out search-exploration work, ... Preliminary assessment. leader in world consumption... is one from essential elements contemporary world economic... world economic system, at the time...
To search in the index, the user must formulate a query and send it to the search engine. The request can be very simple, at least it should consist of one word. To build a more complex query, you need to use Boolean operators that allow you to refine and expand the search conditions.
The most commonly used Boolean operators are:
- AND - all expressions connected by the "AND" operator must be present on the searched pages or documents. Some search engines use the "+" operator instead of the AND word.
- OR - at least one of the expressions connected by the "OR" operator must be present on the searched pages or documents.
- NOT - the expression or expressions following the "NOT" operator should not (should not) appear on the searched pages or documents. Some search engines use the "-" operator instead of the word NOT.
- FOLLOWED BY - one of the expressions must immediately follow the other.
- NEAR - one of the expressions must be at a distance from the other, no more than the specified number of words.
- Quotes - Quoted words are treated as a phrase to be found in a document or file.
Prospects for the development of search engines
The search given by boolean operators is literal - the machine searches for words or phrases exactly as they are entered. This can cause problems when the entered words are ambiguous. For example, English word"Bed" can mean a bed, a flower bed, a place where fish spawn, and much more. If the user is only interested in one of these meanings, he does not need pages with a word that has other meanings. It is possible to build a literal search query aimed at cutting off unwanted values, but it would be nice if the search engine itself could provide appropriate assistance.
One of the variants of the search engine is a conceptual search. Part of this search involves using statistical analysis of pages containing the words or phrases entered by the user to find other pages that might be of interest to that user. It is clear that conceptual search needs to store more information about each page, and each search query will require more computing. Many development teams are currently working on improving the performance and performance of these types of search engines. Other researchers have focused on a different area, which is called natural-language queries (natural-language queries).
The idea behind natural language queries is for the user to formulate the query in the same way as they would ask the person sitting next to it - without having to keep track of boolean operators or complex query structures. The most popular modern site with natural language search queries is AskJeeves.com, which analyzes the query to identify keywords, which are then used to search in the index of sites built by this search engine. This site only handles simple searches, but the developers are in a highly competitive environment developing a natural language search engine capable of handling very complex queries.
A variety of technologies and methods created over the years of development of the theory and practice of information retrieval find their application in modern information systems. Along with the classic library IPS, which continue to improve, intensive development is taking place in the field of global Internet IPS, which has become the main driving force modern technologies information retrieval. The gigantic amount of available information resources requires the use of scalable search algorithms. Hypertexts allow the use of fundamentally new search models based on the semantic analysis of document collections. The high speed of updating pages, their free placement and the lack of a guarantee of constant access leads to the need for constant re-indexing of relevant information resources.
Finally, the heterogeneous composition of users, who often do not have the skills to work with a search engine, forces us to look for effective ways to formulate queries that work with minimal initial information.
6.1. Dictionary information retrieval systems
Dictionary IPS are by far the fastest and most efficient search engines that are most widely used on the Internet. The search for the necessary information in the dictionary IPS is carried out by keywords. Search results are generated in the course of the work of one or another search algorithm with a dictionary and a query compiled by the user in the ISL.
The structure of the vocabulary IPS (Figure 13) consists of the following components: document viewer, user interface, search engine, search image database, and indexing agent.
The information array includes information resources potentially available to the user. This includes text and graphic documents, multimedia information, etc. For the global IPS, this is the entire Internet, where all documents are characterized by a unique URL (URL - Uniform Resource Locator).
The search engine interface defines the way the user interacts with the IPS. This includes the rules for generating queries, the mechanism for viewing search results, etc. The interface of Internet search engines is usually implemented in a web browser environment. Appropriate software is used to work with sound and video information.
The main function of the search engine is the implementation of the accepted search model. First, the user's request, prepared in the ILP, is translated according to the established rules into a formal request. Then, during the execution of the search algorithm, the query is compared with the search images of documents from the database. Based on the results of the comparison, a final list of found documents is formed. It usually contains the name, size, date of creation and a brief annotation of the document, a link to it, as well as the value of the similarity measure of the document and the query.
Fig.13. The structure of the vocabulary IPS.
The list is subject to ranking (ordering by some criterion, usually by the value of formal relevance).
The database of search images of documents is designed to store descriptions of indexed documents. The structure of a typical IPS vocabulary database is described in detail in Part 1 of the Guidelines.
The indexing agent performs indexing of available documents in order to compile their search images. In local systems, this operation is usually carried out once: after the formation of an array of documents, all information is indexed and search images are entered into the database. In a dynamic decentralized information array of the Internet, a different approach is used. A special robot program, which is called a spider (spider) or crawler (crawler), continuously bypasses the network. Transitions between different documents are carried out using the hyperlinks contained in them. The rate at which the information in the search engine database is updated is directly related to the rate at which the network is crawled. For example, a powerful indexing robot can traverse the entire Internet in a few weeks. With each new crawl cycle, the database is updated and old invalid addresses are deleted.
Some documents for search engines are closed. This is information that is accessed or accessed not through a link, but upon request from a form. Currently, intelligent methods for scanning the hidden part of the Internet are being developed, but they have not yet received wide distribution.
To index hypertext documents, agent programs use sources: hypertext links (href), titles (title), titles (H1, H2, etc.), annotations, keyword lists (keywords), image captions. URLs are used to index non-text information (for example, files transferred via ftp protocol).
The possibilities of semi-automatic or manual indexing are also used.
In the first case, administrators leave messages about their documents, which the indexing agent processes after some time; in the second, administrators enter the necessary information into the IPS database on their own.
An increasing number of IPSs are performing full-text indexing. In this case, the entire text of the document is used to compile the search image. Formatting, links, etc. in this case become an additional factor affecting the significance of a particular term. The title term will receive more weight than the figure caption term.
Modern large ISs must process hundreds of requests within a second. Therefore, any delay can lead to an outflow of users and, as a result, to the unpopularity of the system and commercial failures. From an architectural point of view, such IPS are implemented as distributed computing systems consisting of hundreds of computers located around the world. Search algorithms and programming code are highly optimized.
In IPS with a large document base, technologies are used to speed up their work. separation and pruning .
separation consists in dividing the database into obviously more relevant and less relevant parts. First, the IPS searches for documents on the first part of the database. If no documents are found or not enough documents are found, then the search is performed in the second part.
Using pruning (Pruning - English reduction, removal) processing of the request is automatically terminated after finding a sufficient number of relevant documents.
Also widely used threshold search models , which define some threshold values for the characteristics of documents issued to the user. For example, the relevancy of documents is usually limited to some relevancy value
All documents with a relevance value are offered to the user's attention
In the case of ranking search results by date, the threshold values determine the time interval for the date the documents were modified. For example, the IPS can automatically cut off documents that have not changed in the last three years.
The main advantage of dictionary-type IPS is its almost complete automation. The system independently analyzes search resources, compiles and stores their descriptions, and searches among these descriptions. The wide coverage of Internet resources is also one of the advantages of such systems. The size of the database makes dictionary IPS particularly useful for exhaustive searches, complex queries, or for localizing obscure information.
At the same time, the huge number of documents in the system database often leads to too many documents found. This makes it difficult for most users to analyze the information found and makes a quick search impossible. Automatic indexing methods cannot take into account the specifics of specific documents, and the number of non-pertinent documents among
found by such a system is often large.
Another disadvantage of the dictionary IPS is the need to formulate queries to the system in a special language. Although there is a trend towards the convergence of ISL with natural languages, today the user must have certain skills in formulating queries.