This article is from the WeChat public account: Teng Yun (ID : tenyun700) , author: Yi Shu Yun (Institute for the future Law of Renmin University of China), the original title: “web crawler legal? China and the United States have different views “, the title picture comes from: Oriental IC

As the title. This issue is becoming increasingly important.

At the legal level, we must protect personal information without hindering the flow and use of data. How to do?

Is web crawler legal?

A platform uses a “crawler tool” to browse the web content of another platform and grab the information it finds. The behavior that meets these characteristics is defined as “web crawler”.

The verb “climbing” has gradually become a daily term for people. Speaking of “climbing content” and “climbing data”, everyone can understand the meaning of this action more or less.

Data has long been thought of as the oil of this age-even more scarce than oil in many cases. For oil, “stealing” is clearly an illegal act, but for data, is it “crawling” that is illegal?

In recent years, there have been more and more legal disputes caused by “web crawlers”.

You must first answer the question: Who does the crawled data belong to? Only by clearly discussing the “right to belong” can we discuss the “authorization issue” on this basis.

The identification of data ownership is difficult.

Enterprise platform data often contains a large amount of personal data, so platform data can be considered as personally owned, platform owned, shared between individuals and platforms, or public data in the Internet space.

In the legal world, represented by China and the United States, two major Internet nations, each country has no direct and clear legislation on the field of data ownership, and the academic community has not reached a broad consensus.

But at the practical level, the parties often choose to bypass this dispute directly, and proceed from the actual scene to solve the problem within the scope clearly defined and protected by existing laws. -This is an effective way to seek judicial relief.

In the United States, there are four main legal ways for courts to regulate data crawlers:

  • Illegal invasion of private property (trespass to chattels) ;


  • Contract breach (breach of contract) ;


  • Copyright violations (copyright violations) ;


  • Computer Fraud and Abuse Act (CFAA violations) .

    Among them, CFAA has been widely used in practice in recent years.

    This bill creates civil and criminal liability for “deliberate unauthorized or transgressive access to computer information systems and therefore access to information from any protected computer”.

    The U.S. Supreme Court further explained that the CFAA provides for two types of illegal access to a protected computer information system to constitute a crime:

    • Unauthorized access;


    • Unauthorized access, although authorized.

      In China, when a court determines that a web crawler ’s behavior is illegal, it refers to the “legal rights and interests of the operator” as stipulated in Article 2 of the Anti-Unfair Competition Law. Unauthorized web crawlers have illegally obtained data operator user data in large quantities.

      However, the court only defined platform data or data products as “comparative advantages of companies in the competition to protect their exclusive interests”, that is, the court did not recognize the property rights of enterprises to data.

      In addition to this law, China ’s “Anti-Unfair Competition Law”, “Labor Law”, “Contract Law”, “Company Law”, “Sino-Foreign Joint Venture Law”, and “Criminal Law” and other series of laws have trade secrets. It is stipulated that enterprises can protect some data through these channels.

      It is based on this background that this paper hopes to sort out the two typical cases of China and the United States to clarify the two key factors that affect the legality of data crawler behavior: data attributes and authorization mode. It is hoped that it can provide legal risk avoidance suggestions for relevant practices of enterprises.

      01. One of the factors affecting legality: data attributes

      Question 1: Is it “public data”?

      Whether the data is public and accessible is an important factor affecting the legitimacy of crawling behavior.

      How do I define the availability of data?

      In general, data that has been technically protected by the data controller is non-public data. For example, in the Facebook v. Power Ventures case, user data protected by account passwords was clearly identified as not public data.

      For non-public data, the judicial practices of China and the United States have similar claims: the corresponding data crawling needs to be authorized by the data controller.

      Interestingly, for the data that users actively choose to make public, but the data controllers are not authorized to crawl, the attitude of the US courts has undergone a “strict-to-loose” change, while Chinese courts still tend to acquire users And corporate authorization.

      In 2000, Bidder’s Edge of the United States conducted a data crawl on the Ebay website. Ebay therefore filed a lawsuit in the Northern District Court of California for reasons including that the crawl had violated the robot agreement and had an illegal intrusion (trespass) , computer fraud and abuse, unfair competition, etc.

      The court agreed on the charge of illegal intrusion based on the following points.

      • Ebay’s servers are private property;


      • The public access permissions it grants are granted, and Ebay generally does not allow crawler robots to access;


      • Ebay has explicitly informed Bidder ’s Edge that it is not allowed to crawl its network, and acknowledged the charge of illegal invasion, arguing that the defendant’s unauthorized interference with the plaintiff’s possession in the computer system directly caused the plaintiff to be harmed.

        In this case, the court evaded the issue of data ownership. The court determined that Bidder ’s Edge ’s behavior was an illegal intrusion of movable property on the grounds that the server was private, which is indirectly acknowledging that the platform ’s authorization is required to crawl public data.

        But in the 2017 hiQ Labs Inc v. LinkedIn Corporation (hereinafter referred to as LinkedIn case) , the situation has changed significantly.

        Let’s first look at the basic facts of this case.

        hiQ Labs (hereinafter referred to as “hiQ”) is a data analysis company that provides employee assessment services for employers. It uses automated robots to grab users’ public profile information, including names, titles, work experience, and skills, from LinkedIn, a professional social networking site with more than 500 million users, and then processes these data through algorithms to sell analysis results to customers. This behavior continued for five years.

        May 2017

        LinkedIn sent a warning letter to hiQ, asking it to stop unauthorized access and data crawling, and set up corresponding technical measures for hiQ to prevent hiQ from crawling data. In a warning letter, LinkedIn stated that if hiQ does not stop crawling, it will violate a series of federal and state laws, especially CFAA.

        June 2017

        HiQ has filed a lawsuit in the Northern District Court of California, claiming that LinkedIn’s actions violate the California Constitution’s provisions on freedom of speech and violate the “promise of estoppel (Promissory estoppel) “principle, which violates California’s Anti-Unfair Competition Act and constitutes unfair business practices under section 17200 of the California Business and Professional Act.

        HiQ then launched a preliminary injunction against LinkedIn’s actions.

        August 2017

        The U.S. District Court for the Northern District of California supported hiQ’s injunction motion, ruling: LinkedIn must not prevent hiQ from accessing, copying, and using public information on its website; during the temporary injunction, LinkedIn must withdraw and prohibit sending hiQ laws prohibiting its use statement.

        LinkedIn appealed, but the appeal court still supported the injunction.

        The decision in the LinkedIn case is landmark.

        In response to LinkedIn’s claim, hiQ’s continued crawling of its data violates CFAA regulations and is an unauthorized intrusion into a protected computer system. The judge of the District Court for the Northern District of California believes that the key issue is: After LinkedIn issued a warning letter expressly prohibiting hiQ from accessing the data, whether hiQ’s continued capture of LinkedIn’s public data constituted “unauthorized access to the computer” as required by the CFAA.

        First, the judge rejected two cases put forward by LinkedIn to support his views: the Power Ventures case and the Nosal II case. The judge believes that these two cases are different from this case. The data is not public, but is protected by a password verification system, so it cannot be crawled without the authorization of the other company.

        Second, the judge wrote in the judgment: “CFAA must be interpreted in its historical context, keeping in mind the purpose of Congress.” It means that the introduction of CFAA predates the advent of the Internet, and it cannot directly respond to modern technology Level issues.

        The judgment passed by citing the relevant statement of the United States Court of Appeals for the Ninth Circuit in United States v. Nosal (Nosal I):

        “The main purpose of the CFAA promulgated by Congress in 1984 is to address the growing problem of hacking.” Authorized use of computer information system data is a crime. ”

        In other words, if the website can revoke the authorization of anyone at any time and for any reason, and invoke CFAA enforcement, it will make a wide range of Internet users criminal and civil liability.

        Then, this verdict invokes the U.S. Supreme Court’s decision in Packingham v. North Carolina:

        In the current society, social media sites have become the main source for most people to “get real-time information, seek employment, express and listen to opinions in cyberspace, and explore other areas of human thought and knowledge.”

        The court as a whole compared the Internet and social media sites with the concept of “modern public square”. The court held that the normative consensus of the two included “openness and accessibility to all visitors.”

        In addition, the defense of this case from the perspective of freedom of expression is also representative.

        hiQ has hired Professor Lawrence Chebe of Harvard Law School as a consultant. The professor proposed that the right to access data and information belongs to a right of freedom of speech; the essence of data is a kind of speech, and the essence of speech is circulation and sharing, with public attributes, so the web crawling of public data does not require a network Platform or individual authorization.

        Of course, the decision in this case also took into account other factors, not just data issues. For example, hiQ completely relies on the public use of LinkedIn’s public data to run its business. This has not harmed LinkedIn, but once it stops, it will cause a devastating blow to hiQ.

        The reasoning of the judgment in this case is mainly based on a temporary injunction. It mainly considers four factors: the possibility of winning the lawsuit, whether it has irreparable damage, the balance of the interests of both parties, and the public interest. From the perspective of the irreparable damage and the balance of the interests of both parties, it is reasonable for the court to favor hiQ.

        Furthermore, a fact that cannot be ignored is that LinkedIn has allowed hiQ’s web crawling behavior for five years. When blocking this data crawling behavior, LinkedIn just announced that it would provide a service similar to hiQ. As a result, LinkedIn refused hiQ to capture data, allegedly abused market dominance, and excluded competitors.

        Today, the research on user privacy in the United States has developed to the “scenario privacy” theory (contextual privacy) .

        Scenario fairness of data proposed by Professor Helen Nissenbaum The key is not to isolate the information, but to ensure the “contextual integrity of the information (contextual integrity) “.

        In a specific context, the flow of information should meet people’s expectations Does not mean that it allows third parties to collect and use their data for any purpose) ; specific information flows are matched with specific contextualized information patterns, and information shared in specific contexts should not Share in an environment that violates this context.

        Therefore, privacy and personal information protection laws must respect the context

        What’s the situation in China?

        Comparatively, although ChinaAlthough the value and use of personal public data is also valued, the current attitude is still conservative.

        The typical case in China is Sina v. Pulse. The court therefore proposed the “triple authorization” principle of “user authorization + platform authorization + user authorization”.

        The cause of the case was that Maimai exceeded the cooperation authority to obtain and use the professional and educational information of Sina Weibo users; without the consent of Weibo and its users, it showed that Maimai users’ mobile phone contacts and Sina Weibo User correspondence.

        Weibo believes that Maimai does not fully respect the Developer Agreement, fails to respect users’ right to know and free choice, and reduces Weibo’s competitive advantage.

        After Sina sued, the court of first instance mainly invoked Article 2 of the “Anti-Unfair Competition Law” to judge the violation of the law. The content of the law is:

        “Operators in production and business operations shall follow the principles of voluntariness, equality, fairness, and integrity, and abide by laws and business ethics. The unfair competition acts referred to in this law refer to violations of this operation by the operators The law stipulates that acts that disrupt the order of market competition and damage the legitimate rights and interests of other operators or consumers. The term “operators” as used in this law refers to those engaged in the production, operation, or provision of services of goods. “Remarks”> (Products referred to below include services) natural, legal and unincorporated organizations. “

        The court of second instance upheld the results of the first instance, and further emphasized that the application of Article 2 of the “Anti-Unfair Competition Law” in the Internet industry should be more modest and judicial. text-remarks “label =” Remarks “> (2009) Minshenzi No. 1065″ Retrial Cases of Unfair Competition Disputes between Shandong Food Import and Export Corporation and Qingdao Shengke Dacheng Trading Co., Ltd. In addition to the three conditions of [3], the following three conditions must be met to be applicable:

        • The technical measures used in this competitive behavior have indeed harmed the interests of consumers, such as: limiting consumers’ right to choose freely, failing to protect consumers’ right to know, damaging consumers’ right to privacy, etc.;


        • This competitive behavior undermines the open, fair, and fair market competition order in the Internet environment, thereby causing vicious competition or the possibility;


        • Competitive behavior using new technologies or new business models on the Internet should be presumed to be justified first, and the justification needs to be proved by evidence.

          Finally, the court made it clear that the data of huge Sina Weibo users is an important commercial resource. User information serves as the basis and core of social software to enhance the competitiveness of enterprises. In the implementation of the open platform strategy, Sina Weibo conditionally provides user information to developer applications and adheres to “user authorization” + “Sina authorization” + “user authorization” The purpose of the triple authorization principle is to protect user privacy while maintaining the core competitive advantage of the enterprise itself.

          The “triple authorization” principle was proposed by the court in Weibo v. Pulse, based on protecting the competitive advantage of enterprises and prohibiting crawling of non-public user data.

          It is worth noting that although the US court ’s decision in the Facebook v. Power case was based on the view that data belongs to individuals and the platform, the dual authorization principle is similarly established, which requires them to obtain individual users. (Controls their data and personal pages) and platforms (Storing this data in its physical On the server) . After Facebook issued a termination notice, only obtaining the user ’s permission was not enough to constitute an authorization.

          However, this is more about private data.

          For public data, the nature of data rights is a right to share. The basic premise of sharing is openness, and the core is trust. The essence of sharing is altruism, that is, data rights, data utilization, data protection, and data. Value blends.

          It is foreseeable that in the future domestic trial practice will also decide whether to apply or adjust this principle based on the specific case situation.

          Question 2: Original data or derived data?

          Whether the data is original data or derived data is also an important factor affecting the legitimacy of data crawlers.

          Xiong Qianfu proposed to distinguish between original data and derived data, and configure corresponding data rights systems according to different data legal relationships.

          Ownership of the original data belongs to the user, and the user owns the rights to possess, use, gain, and dispose of it; while the ownership of the derived data is based on the “second” creation of the data value, it belongs to the “Creator”, and the “Creator” of the derived data enjoys the right to possess, use, gain, and dispose of.

          According to this logical inference, data on the platform created directly by users or left by user behavior belongs to users, and data processed and created by enterprises belongs to enterprises.

          In the United States, it is not illegal to simply crawl raw data that is publicly available on other enterprise platforms.

          The ruling in the LinkedIn case basically acknowledges that third-party companies can crawl open, original personal data on the enterprise platform with user authorization.

          Facebook v. The Power Ventures case also shows this.

          The basic situation of the case is: Power Ventures focuses on social aggregation services. Users can log in to social software such as Facebook and LinkedIn at the same time, and provide a Facebook login password for Power Ventures to capture users in its Facebook account. data. Subsequently, Facebook knew and sent a ban letter to Power Ventures, and Power Ventures changed its IP address to continue the visit.

          This case is typical of using other people’s intranet accounts to capture data.Does scraping data from online accounts constitute illegal use?

          The 2016 Ninth Circuit’s judgment did not support the plaintiff’s reason, but believed that the defendant’s behavior of continuing to crawl the plaintiff’s webpage after the plaintiff’s explicit withdrawal of the authorization violated the CFAA.

          In short, the fact that scraping user data without the authorization of the counterparty’s company does not sufficiently constitute a violation. Based on this, it can be concluded at least that, from the perspective of data attributes, it is not illegal to crawl the original data on the enterprise platform in the United States.

          In China, it is against the law to crawl raw data published by other enterprise platforms without the authorization of users and counterparties. The “triple authorization” principle established by Weibo v. Maimai clearly demonstrates this. As for the derived data that is more relevant to the company, it is even more self-evident.

          In China, it is absolutely illegal to scrape derived data without the authorization of the counterparty. It is unclear whether user authorization is required.

          The court decision in the case of Taobao v. Meijing is a typical case of judicial authorities protecting corporate derived data.

          The dispute in this case began with the “Business Staff”, a retail e-commerce data product independently developed and operated by Taobao. It is based on the processing and processing of trace information left by Taobao users’ browsing, searching, collecting, adding purchases, transactions and other activities to form derivative data products. It is mainly used to provide Taobao merchants with digital business references. .

          The defendant Meijing Company is suspected of seducing Taobao users who have ordered the “Business Staff” product to download its product sharing and sharing sub-accounts, and lease the “Business Staff” product sub-account on its platform for commission.

          At the same time, the company also organized its platform users to rent the “Business Staff” sub-account, and provided them with the “Leaser” sub-account to view the “Business Staff” product data content by remotely logging in to the “Leaser” computer and other methods. Technology helps and profit from it.

          Taobao company uses the beauty company as the defendant.Filed a lawsuit with the Hangzhou Internet Court, arguing that the company ’s behavior has constituted a substantial replacement for the “business staff”, which has directly reduced the latter ’s order volume and sales, and greatly harmed Taobao ’s economic interests; Maliciously destroys Taobao’s business model, severely disrupts the competition order of the big data industry, and has constituted unfair competition.

          One of the core focuses of the case was that the court held that Taobao had a property right over “business staff” data products, not a property right. Taobao believes that the data content provided by the “Business Staff” data products is derived data based on the collection of massive raw data, deep analysis and processing.

          The court agrees.

          The court held that the “business staff” is actually a network big data product. It is different from the original network data. Although the data content provided is also derived from network user information, the data content of the network operator after a large amount of intellectual labor investment, deep development and system integration, and finally presented to consumers, has been independent. In addition to network user information and original network data, it is derived data that has no direct correspondence with network user information and original network data.

          The court proposed that network operators should enjoy their independent property rights over the big data products they develop. However, as an absolute right, property rights, once granted, mean that an unspecified majority of people will bear the corresponding obligations, which is of great concern. Therefore, the court did not further confirm Taobao’s property rights over the data products.

          The property interests mentioned in the judgment in this case are based on the protection framework against unfair competition. The enterprise must obtain legal protection when it is infringed, and cannot obtain prior remedy. Compared with property rights, the protection of this kind of tort remedy is a negative empowerment model.

          Other categories

          In some cases, the court directly took the existence of investment in the company’s data as a prerequisite to determine whether to protect it.

          For example, in the case of a technical contract dispute between Beijing Sunshine Data Corporation and Shanghai Bacai Data Information Co., Ltd., the court found that the plaintiff has invested a lot in the financial database of the dispute and should be protected.

          In the case of an unfair competition dispute between Shanghai Ganglian E-commerce Co., Ltd. and Shanghai Zongheng Today Steel E-commerce Co., Ltd. and Shanghai Tuodi Electronic Commerce Co., Ltd., the Shanghai Second Intermediate People’s Court also ruled that the plaintiff had invested in A large amount of manpower, material resources, financial resources and time to collect and compile information on steel price dataHave legal rights.

          This approach is very close to establishing a new exclusive property right for a company’s legally formed data, but it is still nominally limited to anti-unfair competition.

          02. The second factor affecting legitimacy: authorization mode

          Generally prohibited

          The general prohibition measures taken by data capture parties on third-party network platforms mainly include: Robots agreement and ToS prohibition. The former is not legally enforceable; the latter has, that is, violation of the ToS prohibition of crawling the data of the other company, which may constitute a breach of contract and requires certain legal liabilities.

          Robots agreement (also called robot agreement or rejection robot agreement) refers to a designated file robot.txt generated by the website owner, used to Specify which directories in the website are not allowed to be crawled by the crawler, and place this file in the root directory of the website server. Friendly crawlers often read the robot.txt file before crawling the pages of a website, and do not download the pages and data that are not allowed to be crawled.

          In general, the websites that are crawled will give the Robots protocol instead of directly using technical means to prohibit access from an IP address. However, the Robots agreement is only a gentleman’s agreement and is not legally binding.

          ToS is prohibited by law, namely Terms Of Service, Terms of Service Agreement. It is similar to the end user license agreement (EULA) of the licensed software. The difference between the twoThe point is that users who use the former do not use software products out of the box, but use services.

          From word processors and graphic design programs to advanced industry software or statistical software services, users can encounter terms of service on a variety of software. There are not a few network platforms that warn against data crawling in a way prohibited by ToS.

          For example, Craigslist, a large online free classifieds website, wrote in its ToS, the “Craigslist Terms of Use”: “You agree not to use robots, spiders, scripts, data extraction, crawlers or any automated or manual tools Copy or collect the content of this website “.

          If the platform prohibits using ToS and refuses to crawl the data of the third-party network platform, the latter does not have the right to crawl and download its data, otherwise it may constitute a breach of contract and require certain legal liabilities.

          Stop letter and IP barriers

          Once the crawled website finds that there is an IP violation against the general prohibition of access, it will send a stop letter and set up IP barriers to prohibit related IP access. This act was found in the United States to revoke the other party’s authorization to continue to access his website. If the crawler continues the data crawling behavior, it will be illegal. However, if the data is crawled with public attributes, the deauthorization measures of the crawled platform are meaningless.

          Craigslist v. 3Taps is a typical case.

          In this case, after the plaintiff Craigslist found that the defendant 3Taps had abnormal access to his website, he sent a letter of restraint to him to prohibit his related IP access. However, after receiving the stop letter, the defendant 3Taps still used a different IP address and proxy server to hide his identity, bypass the IP barriers set by Craigslist, and continue to crawl the data.

          In this regard, the court held that Plaintiff Cr>

          The reason is that, on the one hand, data in the era of big data is of great significance to individuals and businesses, and to society as a whole; on the other hand, domestic and foreign legislation is out of touch with practice to varying degrees, and there is broad space for exploration.

          As of now, the academic community has not proposed a convincing plan for the regulation of reptile behavior. The above-mentioned US CFAA Act and the “Anti-Unfair Competition Law” cited by Chinese judges at high and high frequencies have targeted the current legal issues. Still not satisfactory.

          By analyzing existing cases, it can be seen that the data attributes and authorization modes will obviously affect the legality of data crawlers in China and the United States. Based on public interest considerations, the United States believes that crawling publicly available raw data can be performed without authorization, while other types of data will be protected to varying degrees.

          In China, even publicly available raw data requires a “triple authorization” to legally crawl. As for non-public data and derived data, the strictness of legal protection is self-evident.

          In addition, there are other factors, such as the investment of enterprises in their data, which may also enter the court’s consideration of the legality of data scraping. As for the authorization model, the widely used Robots agreement does not have legal force, and Tos prohibits it.

          When the data crawler violates these general prohibitions, the counterparty company often sends a stop letter and sets up IP barriers. This is a clear indication of the revocation of authorization in China and the United States. If the crawled data is of a type that requires enterprise authorization, the crawler must stop crawling after receiving such representations, otherwise it will assume legal responsibility.

          Further, if the cooperation is carried out through the Open API and other similar methods, the authorization will be withdrawn when the cooperation is terminated. Finally, in the United States, the crawler’s behavior that is known to prevent crawling may establish implied permission to give the crawler legitimacy.

          In general, the protection of personal data in the United States has undergone a relatively long development process, and its attitude is gradually inclined to the public interest, while China’s “triple authorization” model reflects the careful protection of personal interests.

          In the future, we need to comprehensively consider factors such as data attributes, authorization modes, crawling methods, and the use of crawled data to build a legal system that balances the interests of all parties.

          This article is from the WeChat public account: Tengyun (ID: tenyun700) , author: Yi Shu Yun