What is the use of start-ups on the fast track to the data-driven smart world?

Editor’s note: This article is from WeChat public account “Morning Hill Capital” (ID: chenshancapital), author Chenshan Capital.

Advances in technology have driven the full digitization of the economy and life, and the emphasis on data has never been higher than ever. “Data is an asset” has been widely recognized. As Schneberger, author of Big Data Times, said, “Although the data has not yet been included in the company’s balance sheet, it is only a matter of time.”

In the past few years, we have begun to generate value for data from some application scenarios through data science, machine learning, and artificial intelligence. These technologies are also gradually expanding from the early adopters (BAT and startups) to the broader economy. How to store data, manage data, and dig deep into the value of data has become a problem that almost all companies need to think about.

We are on the fast track to the data-driven smart world (smart vehicles, smart businesses, smart products). Before the end of the game, technological development was subverted and subverted, which is where the start-ups come into play. We continue to focus on start-ups on the data intelligence chain, and we continue to focus on what stage data and AI are currently in, and what will change in the future.

Here are the major trends in big data and AI in 2019, summarized by Matt Turck (Partner of First American Funds). He elaborated on major development trends from the perspective of infrastructure, analysis and application, hoping to bring thoughts to everyone.

Original title: Major Trends in the 2019 Data & AI Landscape

Author: Matt Turck Translator: Chen Pu Jiang Chen Shan capital investment manager

The main development trend of big data and AI: infrastructure, analysis level and application level

Image source: pixabay

Trends at the infrastructure level

  1. The third wave? From Hadoop to cloud services to Kubernetes

  2. Data Governance, Data Cataloging, Data Lineage: The Importance of Data Management Is Growing Day by Day

  3. The rise of infrastructure dedicated to AI

The data infrastructure has been rapidly evolving. This speed has been maintained for many years, and there is a trend of accelerating evolution in the near future, mainly through three stages: from Hadoop to cloud services to Kubernetes environment.

Hadopo, born in October 2003, can be regarded as the “original ancestor” in the field of big data. Hadoop is a distributed storage and processing framework for large amounts of data using computer networks, playing an absolutely central role in the explosive development of the data ecosystem.

However, in the past few years, the announcement of Hadoop death has become the norm for industry observers. This trend is further accelerating this year as Hadoop vendors encounter a variety of troubles. At the time of this writing, MapR is already on the verge of closure, and it is possible that a buyer has already been found. Cloudera and Hortonworks, which recently completed a $5.2 billion merger, experienced a tough day in June, with stock prices plummeting 40% due to disappointing quarterly earnings. Cloudera has released a number of cloud computing and hybrid products, but it has not yet been officially sold.

Due to competition from cloud platforms, Hadoop is facing increasing resistance. Hadoop was developed when the cloud was not so complete. Most of the data was local. At that time, network latency was the bottleneck. It makes sense to put data and calculations together. But all this has changed.

However, Hadoop is unlikely to disappear soon. Its development may slow down, but its large scale of deployment between enterprises will make it inertia and vitality in the next few years.

In any case, the shift to the cloud is clearly accelerating. Interestingly, in our conversations with Fortune 1000 executives, 2019 seems to be a real shift. In the past few years, there have been many discussions about the cloud, but the real actions are locally deployed, especially in regulated industries. Many executives of Fortune 1000 companies are now actively turning to cloud computing, some of which involve moving from traditional Microsoft stores to Azure.

Therefore, although cloud providers are already very large, they continue to grow rapidly. In 2018, AWS realized revenue of $25.7 billion, an increase of 46.9% from $17.5 billion in 2017. Microsoft Azure’s revenue was not disclosed separately, but it increased by 73% year-on-year in the March 2019 quarter. Although this is notA perfect contrast, but AWS’s revenue increased 41% year-on-year in the same quarter.

With the deepening of cloud computing applications, customers are beginning to be discouraged by the cost. In boardrooms around the world, executives suddenly noticed a small account: their cloud bill. Cloud computing does provide agility, but it often introduces high costs, especially when customers’ attention leaves the meter or cannot accurately predict their computing needs. For AWS customers like Adobe and Capital One, their cloud bills have grown more than 60% in just one year from 2017 to 2018, reaching more than $200 million.

Cost, and concerns about vendor lock-ups, accelerate the development of hybrid approaches, including a combination of public, private, and local deployments. Faced with numerous choices, companies will increasingly choose the tools that best suit their needs to optimize performance and economics. As cloud providers become more aggressive in their differentiation, companies are adopting a cloudy strategy that takes advantage of the areas that each cloud provider is best at. In some cases, the best way to optimize economics is to keep some workloads (or even recalled) locally, especially for non-dynamic workloads.

Interestingly, cloud providers are adapting to the reality that enterprise computing power is moving into a hybrid environment, like AWS provides tools similar to AWS outposts that allow customers to perform local computing and storage while also delivering local load to AWS. Seamless integration of other programs on the cloud.

In this new era of cloudy and hybrid clouds, Kubernetes is undoubtedly a rising star. Kubernetes is an open source project launched by Google in 2014 to manage containerized workloads and services. It is experiencing the same enthusiasm as Hadoop a few years ago, with 8,000 attendees participating in the KubeCon event, and a steady stream of Blog posts and podcasts. Many analysts believe that RedHat’s important position in the Kubernetes world has largely contributed to IBM’s large-scale acquisition of $34 billion. The future of Kubernetes is to help companies run workloads across environments. The enterprise’s hybrid environment will include data centers, private clouds, and one or more public clouds.

Kubernetes, as a framework for managing complex, mixed environments, is increasingly becoming an attractive option for machine learning. Using the same infrastructure to serve multiple users, Kubernetes gives data scientists the flexibility to choose any language, machine learning library or framework they like without having to be an infrastructure expert, and to train and extend the model, allowing for relatively fast iterations. Strong reproducibility. Kubeflow is a machine learning toolkit developed for Kubernetes, and it is growing rapidly.

Kubernetes is relatively in its infancy, butInterestingly, because data scientists may prefer the overall flexibility and controllability of Kubernetes, this may signal an evolution away from cloud machine learning services. We may be entering the third paradigm shift in data science and machine learning infrastructure, from Hadoop (until 2017) to data cloud services (2017-2019), to one dominated by Kubernetes and next-generation data warehouses (such as Snowflake). world.

The other side of this evolution is the increase in complexity. Of course, there will be an opportunity for a comprehensive platform. The platform abstracts and simplifies the manipulation of a large number of cloud infrastructures, making it easier for a broader group of data scientists and analysts to access this beautiful new world.

The Serverless mode is an attempt to simplify this, albeit at different angles. This execution model allows users to write and deploy code without worrying about the underlying infrastructure. The cloud provider handles all back-end services and the customer pays based on actual usage. In the past few years, the Serverless model has undoubtedly been an important emerging topic, and this is what we have added to this year’s Data& AI field. However, there is still much work to be done to apply the Serverless model to machine learning and data science, and companies like Algorithmia and Iguazio/Nuclio are early entrants.

Another consequence of the growing mix of data environments is that companies need to increase their efforts to gain control of their data.

The current data environment is very complex, some in the data warehouse, some in the data lake, some in various other data sources, across local deployments, private clouds, and public clouds. How do you find, manage, control, and track data? This includes a variety of related forms and names, including data queries, data governance, data cataloging, and data lineage, all of which are increasingly important and prominent.

  • Querying data in a hybrid environment is a challenge in itself, and its solution is in line with the overall trend of separation of storage and computing.

  • Data governance is another area that is quickly becoming a top priority for businesses. The general idea of ​​data governance is to manage data and ensure the high quality of the entire data lifecycle, involving data validity, integrity, availability, consistency, and security. It is worth noting that in early 2019, Collibra conducted a round of $100 million in financing with a valuation of more than $1 billion.

  • Data cataloging is another increasingly important tool for data management. Effective data cataloging is a dictionary of comprehensive data assets for an enterprise. They help users (including data scientists, data analysts, developers, and business users) to self-discover and use data.

Finally, data lineage may be the latest data management category. The purpose of data lineage is to capture a “data journey” across the enterprise. It helps companies figure out how data is collected, how it is modified, and how it is shared over its life cycle. Many factors have driven growth in this area, including the growing importance of compliance, privacy and ethics, as well as the need for repeatability and transparency of machine learning pipelines and models.

The last key trend that has been accelerating this year is the emergence of AI-specific infrastructure.

The need to manage artificial intelligence pipelines and models has led to rapid growth in the MLOps (or AIOps) space. In response to this new trend, in this year’s Landscape, we have added two new boxes, one called Infrastructure (various early startups including Algorithmia, Spell, Weights & Biases, etc.), one called Open Source ( A variety of projects, usually quite early, including Pachyderm, Seldon, Snorkel, MLeap, etc.).

ML engineers need to be able to run experiments and iterate quickly, accessing resources such as GPUs when needed. At our Data Driven NYC event, we introduced some of the early startups that provided infrastructure such as Spell, Comet, and Paperspace.

With the rise of GPU databases and the birth of a new generation of artificial intelligence chips (Graphcore, Cerebras, etc.), artificial intelligence has had a profound impact on infrastructure. Artificial intelligence is forcing us to rethink the nature of computing.

Analytical development trends

  1. Business Intelligence (BI) is integrating

  2. Enterprise AI platform is a trend

  3. Horizontal artificial intelligence is still very active

In the business intelligence arena, as mentioned earlier, the obvious trend over the past few months has been a lot of consolidation, including the acquisition of Tableau, Looker, Zoomdata and Clearstory, and the merger of SiSense and Periscope. Because there are a large number of vendors for data visualization and self-service analytics services, integration in BI is inevitable to some extent. Every supplier, big or small, is under pressure to diversify and expand. For cloud buyers, these new product lines will definitely increase revenue, but more importantly, they have additional features that can help cloud buyersThe core platform generates more revenue.

Do BI have more integration in the future? Microsoft has a strong position in Power BI, but when the entire market segment is integrated and every company is effectively involved, the M&A market may have its own dynamics. AWS may need a stronger product, considering that its QuickSight BI is generally considered a bit behind.

With the integration of BI, the popularity of data science and machine learning platforms continues to rise. Deploying ML/AI in the enterprise is a huge trend, and this trend is still in its early stages, and various participants are scrambling to build a platform.

For most companies in the field, the clear goal is to promote the democratization of ML/AI, which is to allow a larger user base and more companies to benefit from ML/AI. The current shortage of talent is still the main bottleneck used by ML/AI. However, different players have different strategies.

One method is AutoML. It involves the entire life cycle of automated machine learning, including some of the most tedious parts. Depending on the product, AutoML handles a variety of tasks including feature generation and engineering, algorithm selection, model training, deployment, and monitoring. DataRobot is an AutoML expert and has raised $100 million in Series D financing since 2018 (it is said to have raised more since then).

Other companies in the field, such as Dataiku, H20 and RapidMiner, offer AutoML-enabled platforms, but also offer a wider range of features. Take Dataiku as an example. Since 2018, it has successfully integrated the C round of $101 million. Its overall idea is to empower the entire data team (including data scientists and data analysts) to make data processing throughout the lifecycle simple and fun.

Cloud providers are of course very active, including Microsoft’s Learning Studio, Google’s cloud AutoML and AWS Sagemaker. Despite the strength of cloud providers, these products are relatively narrowly positioned – often difficult to use, and are targeted at advanced users with high technology content. They are still in their infancy. According to reports, Amazon’s cloud machine learning platform Sagemaker started slowly in 2018, selling only $11 million to the commercial sector. Some cloud providers are actively working with professional third parties in the field, and Microsoft is involved in Databricks’ $250 million E-round investment, which may be a prelude to future acquisitions.

In addition to enterprise artificial intelligence platforms, horizontal artificial intelligence (including computer vision, NLP, voice, etc.) continues to be incredibly vibrant. The main trends are as follows:

  • Significant improvements in the field of NLP, especially through the transfer of learning applications (including the use of a large number of data training models, the use of its model to solve a specific problem through migration and fine-tuning), let it use less data Can work: eg ELMO, ULMFit, and most importantly Google’s BERT.

  • The industry has made more efforts to implement artificial intelligence with less data, including one-shot learning.

  • The combination of deep learning and reinforcement learning.

  • Continuous progress in the Build Against Network (GAN).

Application-level trends

  1. The stage of enterprise deployment of ML/AI has come

  2. The rise of enterprise automation and RPA

At this stage, we may take three to four years to try to build ML/AI applications for the enterprise.

Of course there have been some sly product attempts (first generation chat bots) and some major marketing statements that are far ahead of reality, especially as some companies try to retrofit existing products to achieve ML/AI.

However, we have gradually moved into the deployment phase of ML/AI in the enterprise, from curiosity and experimentation to actual production use. The trend for the next few years seems obvious: take a given question as an example to see if ML/AI (usually deep learning, or its variants) will have an impact, and if so, build an AI application to Effectively solve the problem.

This deployment phase will take place in a number of ways. Some products will be built and deployed by internal teams using the enterprise AI platform mentioned above. Others will be full-stack products with embedded artificial intelligence capabilities from different vendors, where the artificial intelligence portion may be largely invisible to customers. Others will be provided by suppliers that offer hybrid products and services.

Of course, it is still too early. Internal teams typically start by dealing with a use case (such as customer churn prediction) and begin to expand to other issues. Many start-ups building ML/AI applications are still learning to overcome the challenges of R&D to full-scale operations.

However, maturity is coming. In the past few years, for anyone who wants to deploy ML/AI in real-world applications, a lot of learning is needed. With regard to what technology can and can’t do, we begin to understand better between machines and people. The correct assignment of tasks. People have learned a lot from the first generation of artificial intelligence applications. For example, from the user’s point of view, the next generation of customer service chat bots isML/AI provides a smarter mix between configurability and transparency.

Looking ahead, with ML/AI becoming more popular with the support of high-performance data stacks, have we seen the dawn of a fully automated enterprise?

Since the emergence of information technology, enterprises have been plagued by information silos, various systems and data are scattered to various departments, and they cannot communicate with each other (which leads to a large-scale system integration service industry), while humans act as two “Binder” between the people. With the increasing integration of current data and systems, ML/AI has the ability to gradually remove humans from certain functions, and companies are entirely likely to operate in an increasingly automated, systematic way.

For example, suppose an automation company, the increase in demand (predicted by ML) automatically triggers an increase in supplier orders, which is automatically recorded in the financial system (the financial system can automatically calculate and pay compensation allowances, etc.); or The decline in demand will automatically trigger an increase in the corresponding marketing expenses and so on.

In the future world, companies will not only become fully automated organizations, but will eventually become self-healing and autonomous organizations.

However, we are still far from that stage, and today’s reality is mainly concentrated on the RPA. This is a very hot area, with leaders like UI Path and Automation Anywhere growing very fast and raising a lot of money.

RPA is short for robotic process automation (although it may be disappointing that it does not utilize any actual robots), involving usually very simple workflows, usually manual (executed by humans) and repetitive, And replace them with software. Many RPAs occur in back office functions (for example, invoice processing).

RPA is driven by the wave of enterprise digital transformation, especially in the past few years, the digital transformation has been accelerating. Some RPA leaders have been around for many years (UiPath was founded in 2005), but when digital conversion became a daily topic, it suddenly became popular. RPA also provides a powerful ROI because its implementation can be directly compared to the cost of humans performing the same task. RPA is also very attractive to technology service giants because it involves a large number of implementation services (requiring a myriad of different workflow configurations); therefore, RPA startups benefit from strong partnerships with these large service companies.

There may be reasons to suspect RPA. Some people think that this is largely unwise “band-aid” or a matter of expediency – using an inefficient workflow performed by humans to get the machine done. From this perspective, RPA may only be creating the next layer of technical debt. As the environment changes, in addition to causing more RPAs to reconfigure old tasks into the new environment, it is unclear what happens to the automated RPA functionality. At least at this stage, RPA is more of aAutomation, not intelligence, is more of a rule-based solution than artificial intelligence (although some RPA vendors have enhanced their artificial intelligence capabilities in marketing materials).

RPA should be distinguished from intelligent automation, which is an emerging field centered on ML/AI. Intelligent automation also targets enterprise processes and workflows, but it is more data-centric, not process-centric, and ultimately able to learn, improve, and heal.

An example of intelligent automation is Intelligent Document Processing (ADP), a category that can use ML/AI to understand documents (forms, invoices, contracts, etc.) at levels comparable to humans.

It will be particularly interesting to observe these areas in the coming years. RPA and intelligent automation are likely to merge through mergers or acquisitions or new local products, unless the latter progress so quickly, limiting the demand for the former.