[Scroll down for discussion transcripts]
An AI tsunami is on the rise, and the past few months have only amplified it. To survive it and thrive in tomorrow’s economy, organizations big and small must rethink the way they do business. To do this, a radical shift in the way they work with their data is needed. And no, we don’t mean Big Data.
By now, most organizations have gotten their Big Data. And that is a problem. Not because we can’t accommodate Big Data, but because the more data you have, the harder it becomes to connect it and use it. We need to go beyond Big Data, towards Connected Data.
We’ll show how enterprises can use decentralized Knowledge Graphs to vastly increase the connectivity of their data, drawing on hard won experience of architecting and successfully delivering innovative technical projects for the world’s largest financial organisations.
Large enterprises that want to survive the AI tsunami must undergo a profound transformation in the way they think about their data. It starts by accepting that they need to link a large percentage of ALL their data together into a unified whole.
Achieving this will require a radical rethink of some established ideas about enterprise data integration. The truth is meaningful data doesn't exist in isolation; everything is positioned within the context of everything else.
That is why the future of data is graph shaped … but what are graphs and what is so great about them?
Knowledge Graphs are a really powerful tool, but on their own, they are not enough to transform enterprise data integration. We also need to get our heads around the complex idea of decentralisation. In a decentralised data mesh, the responsibility for data integration is pushed down to the individual applications.
Unbeknownst to most people, a third of all web pages now contain little islands of data that help the search engines build their knowledge graphs. Enterprises do not need to reinvent the wheel to build themselves a Decentralised Enterprise Knowledge Graph. They can just take this battle-hardened web tech and use it behind their firewall to connect their internal data.
In other words, the tools for this job already exist but enterprises are not yet using them internally. In this talk, we’ll share the hard won experience of how this was done for the world’s largest financial organizations.
[00:00:00.150] - George Anadiotis Hello, everyone, and welcome to the latest edition of Connect the Data. London's online meetup. Today, we have the theme to our meetup, which is enterprise known as crafts and how they can be used towards a knowledge based economy. We have three ducks lined up for you and actually we're already kind of late for the first one. So I'll try to keep my intro as sort of possible and just give the floor to issue our first guest. Tony, the floor is yours.
[00:00:42.420] Actually, one last thing before I give you the floor, so big thanks to our sponsors today, Enterprize Web and France are also presenting next week. Thanks, Tony. Floor is yours, OK?
[00:00:56.510] - Tony Seale Hi, guys. Assuming everyone can see my screen, let me know if there's a problem. So so the the first part of my message really is very simple, and that is that there is an eye tsunami coming. But most people don't realize it yet. And enterprises are going to have to perform a 180 in the way they think about their data if they're going to survive that. I see army and small tweaks and changes aren't going to cut it.
[00:01:22.190] We're talking about an existential shift in mindset here. And this starts with accepting that the way that large enterprises run Netdata at the moment is fundamentally broken, in my opinion.
[00:01:35.900] How can that your smartphone can recognize your face and understand your speech, but your I.T. department takes over two months to add a couple of fields to your report.
[00:01:45.140] And I think the first thing to realize here is that A.I. is built upon top of a data layer and within the modern enterprise outside of the tech giants like Google and Facebook, you have got this one sorted.
[00:02:00.470] Outside of that, the data layer is broken within the large, large enterprises. In fact, it's nonexistent in a large enterprise. They'll have literally thousands of separately siloed pieces of data with systems connected together in a spaghetti of different technology. There's little standardization around security access models, and there's a lack of a data culture which exists within the tech giants. Now, in many ways, this is not a new observation, because what we're talking about here is the data integration problem.
[00:02:39.560] And the classic solution to the data integration problem is to build yourself a datamart or a data warehouse, which is just a big database where you grab all of your data and perform what's called a etel process to to to to basically merge the data all into one place. But the data warehouse has several problems. First of all, you've got to define the schema up front, which is not very agile. And then the business as the business changes, which they continually do change the data warehouses really struggled to keep up with that.
[00:03:13.190] But perhaps more fundamentally, and this is where, like the change of mindset that I'm proposing comes in, is that order, in order to survive the eye wave, then enterprises are going to have to kind of change their thinking and got to start thinking about connecting the total some of the data assets within their company. And even with a really good data warehouse solution, you're not really talking about connecting more than, let's say, one percent or even less than that really of the total data assets at the company and where I see the future going, that's just not going to that's just not going to cut it.
[00:03:52.730] So then we've got the new kid on the block, which is Hadoop and various kind of big table technologies. And this is a step forward. All of these things are important. Step forward. And I wouldn't want to denigrate any of the amazing effort that's gone on on the road to progressing towards data, a data integration. But I think it's fair to say that the sort of promise of Crystal-Clear Waters of Data Lakes hasn't really materialized for for the enterprises.
[00:04:22.160] They've they've they've blindly followed the tech giants into implementing this technology. And in most cases, they've ended up with a kind of data swamp which kind of sits there, not really doing very much. And that's because although that they've grabbed related data, they haven't really tackled the hard part of the problem, which is conforming to schemas, to linking and connecting your data, to cleaning it up and extracting out the knowledge. And then also I've got news, which maybe is not news to many of the people on on this presentation, but some people will be that the cool kids have moved on.
[00:05:00.820] And in fact, they moved on quite a long time ago and they've started so.
[00:05:06.300] So Google and Facebook have started building knowledge cross over the top of their data structures they've got. And the knowledge graphs provide schema and meaning to the data.
[00:05:16.270] Now. I think what we're going to see as knowledge graphs move up the hype cycle, and someone told me the other day that's already kind of into the trough of disappointment already, that that enterprises will increasingly look to adopt an enterprise knowledge graph. But but I've got a clear message here that I'd like to convey, which is that I don't think enterprises should just repeat the mistakes of the past and blindly copy what Google and Facebook are doing here. What do I mean by that?
[00:05:44.880] Well, you know, the knowledge graph is a genuine innovation. So for the first time, data relationships are able to be made first class citizens, and that's really, really important. But the situation for the enterprise is different from the Internet giants.
[00:06:01.290] There's the diversity of the data and then just a kind of lack of data culture, which is which is deep, deep seeded in the Internet giants, but is not there within the enterprise. And that needs to be tackled. And in this sense, I think what we need to do is we need to combine the powerful technology of enterprise knowledge graph with with another topic which is merging up through the hype cycle, which is decentralization. And it's actually through the combination of these two things that are creating a sort of distributed enterprise knowledge graph that we can potentially unlock this revolution.
[00:06:43.550] So. Let's turn to the graph part of this then, so most of you will know what graphs are. But but I'd just like to spend a little bit of time on those to run through them and explain what's so great about them. And the first thing that I think is great about them is they're very intuitive nature. So a graph is made of nodes, which we can see as these circles here and these nodes are connected by what are called edges.
[00:07:10.840] And in terms of how you might kind of draw out your own knowledge about something, then this structure, this graph structure is probably very intuitive to you. And it's maybe something you'd even kind of like doodle on a piece of piece of paper.
[00:07:27.670] But OK, so that's Graff's, but what is so great about them in this context?
[00:07:33.760] Well, the first thing I think is that they can capture the fundamentally messy nature of data, but they still retain enough structure for computers to sort of reason over them. And I've I've started calling it a kind of a structured scruffiness. So so let's head to the main data structures that sort of non graph data structures that we're all familiar with today. And these are tables like Excel and relational databases and trees. So the tree within your file system or if you're technical XML document or JSON documents, for example, of tree like structures and the flexible nature of graphs mean that they can model both treelike data and tabular data with ease.
[00:08:24.220] They can also contain even more complicated things like cycles. And then this flexible nature means that they can even be used and are used sort of our on and on the web to provide islands of of sort of certainty and structure within more complicated data structures like text and images.
[00:08:45.790] So in this sense, the graph doesn't so much replace these are the data structures as provide a universal way of representing them and connecting them together, which I guess is is why it's particularly right for this data integration problem.
[00:09:03.450] That kind of goes to what is really the next great thing about graphs, and that is the graphs make relationships first class citizens, but just as important as the notes themselves. And I would suggest that the more time that you spend thinking about data, the more that you realize that, in fact, data is all about the relationships, that the facts don't exist in isolation, but rather they are positioned within the context of everything else.
[00:09:37.000] So. Maybe spend a moment after this contemplating that and and I guess they ask you to stop thinking about data in terms of separated business aligned systems and instead concentrate on the relationships between them. Stop thinking about your data as a kind of connected whole. Now, some people will say, well, I could do all of this within a different data structure, I could do this within a relational database or graphs or documents or whatever the particular favorite one is.
[00:10:10.000] And whilst this is true, I think it misses the point because graphs are made for this job, they make it really simple. And all you have to do is to break your data down into these like three part statement. So Ben is a person. Bob is a person. Benson is Bob. Bob lives in the U.K. You can just keep on adding these little triples ad infinitum and that's it. You atomize your data into these three part statements.
[00:10:39.550] And then as if by magic out the other side pops the this this this huge graph. It's not rocket science and there's some misconceptions about that, really. I think it's actually really rather simple and anybody can do this. And that's why I think the graphs are one of the main keys for connecting Enterprise's data together into a unified layer. And what I think is is is so great about. So so now I want to turn to the second key, so the concept of decentralization, so.
[00:11:27.020] This this concept is a bit harder to get your head around, it's even got a complicated name of decentralization. So let's try and break that down a little bit, broadly speaking. And I'm sorry if I get a little bit philosophical here, but I will return back to data. But but broadly speaking, I think we can break any complicated system, any complex system down in in three different ways.
[00:11:52.250] Firstly, we've got disordered systems. So there's a system with no rules or laws we can think of. This is the wild, wild west where you can do whatever you like.
[00:12:02.600] Then secondly, we have a highly centralized command and control systems where instructions are issued from the center down to the mass. And you can think of like the former USSR here or like in a more scary scenario like Skynet from the Terminator movie.
[00:12:21.740] And then finally, we have decentralized systems where we have a universally agreed upon set of standards that everyone follows. And then things are organized amongst nodes themselves where they compete and collaborate with each other within this kind of common framework and to sort of elicit the difference between the centralized model and the decentralized model. I like to tell a story, which is that apparently an official travelled from the former USSR to learn about the free market in London because despite years of hard work by really intelligent team, the the talented experts had failed to minimise the bread queues and the Russian people were having to queue a long time for bread anyway whilst travelling around London, looking at the London Stock Exchange, etc.
[00:13:16.550] , etc. in these various instruments of the free market economy, he eventually said, look, what I really want is for you to introduce me to the guy who's in charge of bread supply, because we've been working on this problem for absolutely ages.
[00:13:29.090] And, you know, I'm desperate to meet him and learn his great secret. Take me to him right now. And of course, within London, there was no one in charge of bread supply and and let's think about that for a second. So within the regulated market, the companies collaborated and competed with each other and a highly optimized bread supply chain simply arose. That's a really important point there. The complexity was just too great for these guys to hold in their heads and in the decentralized system.
[00:14:05.920] Responsibility was instead pushed down into the individual companies and then an optimized bread supply chain simply rises out of thin air. That is the power of decentralization. So we can turn back to data now and in most large enterprises, I think it's fair to say that they operate in some kind of the wild, wild West mode where applications exchange data in a hodgepodge of different formats from a hole, from a high level perspective or even the low level spectrum. As a developer, to be honest with the whole thing, is is a complete nightmare of different technologies and and just basically a disorganized mess to see this.
[00:14:54.360] I don't know how many of you are technical, but if you've ever seen like a kind of information flow of of of your company or just one department within your company, you'll know exactly what I'm talking about. So let's turn to the centralized model, and so the one that we we kind of looked at in the initial couple of slides and then to be honest, like the centralized, the centralized model of organization in general, I sort of find something slightly creepy about that.
[00:15:25.790] But when we turn to looking at the central data central model for data integration, then scary efficiency really is the very last of our concerns and. My message here, it doesn't seem to matter what you put in the center here, a data warehouse or a data lake or perhaps controversially an enterprise knowledge graph, it's the centralized system just can't seem to connect more than a very small percentage of the overall company's data. And if you think about it, then it kind of makes logical sense because these guys are like the US are guys trying to plan the bread supply for the whole country.
[00:16:12.510] There's just no way that this central store can pull and organize all the data from thousands upon thousands, like literally somewhere between five, five, 50 to 70000 different separate systems. There's just no way that the centralized system can contain all that complexity within itself when each one often is quite highly complex in its own right. Anyway, here's the bottom line. After 20 years of dedicated or more really effort, the centralized approach is just hasn't been able to link more than a small percentage of the overall company's data.
[00:16:51.870] Meanwhile, the AI revolution is unfolding around us and the clock's ticking. So let's think about a pivot here. Let's apply the decentralized model to data integration. In a decentralized database, applications publish and consume data in a standard structure. The data stays where it is, but then it's exposed in this standardized layer, it's it's published rather than pulled an individual, applications are responsible for linking to each other. In this way, responsibility for the data integration is pushed down into the individual applications, who, after all, let's face it, are the experts in their own data and already know how they connect to each other in the spaghetti bowl.
[00:17:40.310] So so here I like to ask a question, you know, could could this model the better than the the one percent of connection, could it come close to connecting all of the data, all of a company's data together? And then I think, well. Imagine that the CEO of a large enterprise said within the next six months, I want every single one of these applications to publish their data according to standard, this standard form, and then where they connect to each other.
[00:18:07.490] I want you to form links between each other, go off and do it, each one of you. It's my number one priority. Make it happen within six months. The resulting effort to bring all of the data into a unified whole. Well, I don't know, about six months, but in a way, I think it sort of could it would be rough and and messy at first.
[00:18:27.410] But if you set up what I'm calling like the right internal data market, then with a bit of luck and skill, hopefully the motor of decentralization to fire up, just like it did with the bread supply chain and a clean and connected data layer would would would would arise.
[00:18:49.660] So I leave you to think about that one. OK, so having looked at decentralisation, we can now turn back to the more intuitive concept of Graff's, because, of course, this standard structure that we're going to ask each of the applications to publish in is the very same one that we talked about before, these simple three part statements. So each one of these applications publishing up the three part statements and. They're therefore making available just their part of the graph and then where the data is a data item from from one of these applications, links to the data in another one of the applications, then the applications create links between each other and therefore our decentralized graph.
[00:19:47.420] OK, so we've talked in the hypotheticals, but down to like a more realistic level, like how this all sounds very nice, but what would it actually look like to implement something like this? And I can't tell you all of that over this this this talk now.
[00:20:03.490] But but let me give you an intuition, because. Think about the Web. It's the most successful integration project in human history, I would claim. It can sort of connect to the information you've got products on their financial instruments, you know, if you've got facts, you've got photos, you've got music, people are that up masterpieces. It's almost like the near sum of human knowledge. It's spread out across the entire planet possible through a browser window, and in short, I believe that is a proven model for large scale, complex information integration.
[00:20:43.060] So from a technical perspective, we can view the Web as a decentralized graph of connected document. And as I've hopefully laid out. It has the potential to revolutionize data into integration. So this is the cool thing here, I think that the Web is just everything to create all of these building blocks are there already to create this decentralized knowledge graph.
[00:21:15.150] Put more simply, the tools of the web to connect all of the near some of the data within enterprise together so that just like with a web browsing through a single lens. Let's drill into the third, so on the Web technology, we've got some core things, servers. We have HTML and then we have hyperlinks, and that means that anybody can publish up their document and connect it to someone else's document, regardless whether that person is living next door or is on the other side of the world.
[00:21:52.550] I need you to just scrub the web as it is some of you is that over the last few years, some incredibly clever people have worked really hard to use these images, adapters and your URLs and hyperlinks for data.
[00:22:12.290] So to give you an intuition on that, then currently if you click on a link, you get back the Web equivalent of a word document. And with the tech that these guys have created it, you click on a link to get back the Web equivalent of a net and links to another. Then we use a hyperlink to connect the. So at the heart of this idea is the euro, each data item gets its own unique euro and then so just just like each page does on the Web X-Rite data.
[00:22:48.920] Right. And we use the URL of that data item to create a hyperlink now.
[00:22:55.650] This is data integration on the website Gaile, and most people don't know their third, yeah, that's a third of all Web pages now contain these little islands of data and Google and Facebook are crawling the web out of these.
[00:23:13.980] So that's in a third of all Web pages. Wait a second, if Web scale is built, why not take this exact same tape and we're it in the Enterprise firewall, it gives us everything that we need across enterprise knowledge grogs.
[00:23:43.310] Let's let's mentation level for a second, what would this look like? Well, first of all, above each application, you'd give it its own little data website. So GPL server and it publishes its key date, the W3 Standard RDF, which is nothing different than those three little statements that are distributed. Graw.
[00:24:11.090] And they give each day to write a unique Eurail, so.
[00:24:18.160] An application is part of the and then they're connecting and linking together with each other, resolvable your your your ls.
[00:24:33.940] This architect scale, like the Web scale, you can keep on adding Web servers, and that's exactly what you can do here is and it's simply exposed out in this vast connected network. And it's this center is now facing them today trying to work.
[00:24:57.830] Download all of the Web into the Web browser. These or an application that doesn't need to have or for that matter, doesn't even have the rights rights to access orbed rights and. I kind of stick level, we could view this is letting the girl got this vast distribution distribute, it should grow available by the data website publication and then we've got to see how fast compact graphs where.
[00:25:33.360] You is basically just getting the subcontract their. By connecting the information. The Web has made humanity more powerful, I think, and we can say that, but not everything about the way is good. But I think undeniably a species level, by creating this rich network of connections, it's made humans fast. It's made it more efficient, and it's made us more adaptable to change. Now, when we look at the data integration problem and we look at the data that is within our large enterprise, the problem can, of course appear to be insurmountable.
[00:26:25.470] And for me to be coming on here and talking about connected in the near term over there at an enterprise data, that whole can seem ridiculous and somewhat silly. But all I would say to you. It is those that it is possible to connect information at scale, it gives us a model for success. So what do we have to do?
[00:26:48.950] Well, we take the standardized battle hardened tech stack that runs the Web and we apply in the boundary of our enterprise firewall.
[00:26:59.290] The result of that would be a decentralized enterprise knowledge graph that is capable of integrating the near sum of all of your data together into a richly connected network. And then connecting your data will make your enterprise faster, more efficient, make it more adaptable to change, ensure it'll make it more powerful and much better positioned for the long coming. Oh, my. And that's that's me finished their. Good looking slide. Some people ask, how did you actually produce those lights?
[00:27:57.680] They are. They're handwritten in it. I guess it shows me how obsessed that are that I've been with with this subject that I've lovingly handwritten them.
[00:28:10.270] I thought so, but, you know, I just I just figured that many people were actually curious. So another question that we got was whether you can actually recommend technical stuff for for working for implementing this vision that you outlined and specifically using open source tools, something that you're familiar with that you have worked with and you can recommend.
[00:28:35.500] Yeah, so so one of the good things is there is a lot of open source here.
[00:28:41.710] So if you were if you were looking to create your sort of data website and your data consumer, you could do a lot worse than look at the initiative that's currently being led by Tim Berners Lee, which is the solid initiative. So it's it's out on the Web.
[00:28:59.290] They're thinking about it in terms of allowing data sharing for individuals out on the Web.
[00:29:10.210] But it provides a pretty good model as a starter there.
[00:29:14.860] Then pretty much every language that you can name, you will find HTP servers and you will also find libraries for working with RDF and manipulating that. Those libraries will include stores for querying the data. But you ultimately, if you you kind of start pushing the boundaries of that and want to have a lot of information in one of these these nodes, you will probably will want to turn to a commercial what they call triple store at that point of which there is a vibrant market developing.
[00:29:54.790] Yeah, yeah.
[00:29:55.290] I was going to say that this is really a great initiative, even though at this point it's not really production ready. I mean, we have seen some some use cases recently presented by commercial entities that are using it. But at this point, it's more proof of concept than production.
[00:30:14.620] Yeah, that's another question that we get is whether well, whether you can very, very quickly outline the difference between a relational database and graph in terms of the result of an application. So the question is that, well, you can query data using both and you can you know, to the end user it may look the same. I guess the answer would be that, well, it's not as easy to integrate data and that's the whole point.
[00:30:41.770] But yeah, yeah. Once you execute a query, if you execute a query and it's coming back in tabular form, I mean, generally you get an option of either I'm going to execute query and I'm going to bring back in tabular form or I'm going to bring back and basically select a subclass. If it comes back into a graph form, then that's a richer structure that you might do like graph visualizations over and that sort of thing. So that's kind of one difference.
[00:31:09.850] But at the table level, not not particularly different. But but I think you get the point there that it's that for me it's the power of the graph database is that it's making connections, the relations between data items, first class citizens. And when you marry that up with HTP and a decentralized approach, it gives you the it gives you the kind of power you need in order to connect your information together.
[00:31:35.800] Mm hmm. So one last question that that we can take. Let me see. So what's the probability that the was created by different applications actually communicates with different expressions so different? I think that's that's a tricky one.
[00:31:52.260] It is. But I'd like to speak to it.
[00:31:54.040] So so this is where like this kind of idea of a creating a an internal data market so that it's by no means the end of the job for the people in the centralized data and architecture teams.
[00:32:13.540] They've got a big job in front of them now. It's just kind of changed. So so now we're talking about, OK, we're going to push this data integration problem out to the entire organization and let everybody get involved in it. So we're going to let users kind of start specifying the queries that they want and we're going to let applications start publishing the data that they hold and start connecting to each other. But if they just publish out exactly what they've got, then all you're doing is, is publishing a massive hairball that's just going to be as messy.
[00:32:44.830] It will be connected, but it will be. It will be it will be a disorganized mess. So you're not really you've again, run away from the problem if you if you do that. So so so the job of the central architecture team becomes a very, very hard one, which is to kind of Stewart and shepherd this process. So they need to define like what's called an upper ontology, which is basically like. Sort of model, and fundamentally, this is what this is part of the mindset shift that we go, that an enterprise kind of goes about stop thinking about its data in terms of the applications that are holding the data and put a kind of interface over the top of that.
[00:33:24.110] And that interface is the model, is the common enterprise ontology.
[00:33:28.550] So within these large enterprises, they might have 50 to 70 separate applications, some of those applications with even thousands of different tables in them. But when you kind of boil it down to the type system that's sitting in there, the actual entities that these things that they're kind of representing, like the trades or the people or the et cetera, you're probably not talking about more than sort of five or six hundred concepts. So it's a massive, massive complexity reduction to abstract away the application layer and and put in front of that this nice, clean, connected model.
[00:34:10.160] But that's the hard work. What, what, what the architecture that I have laid out here provide you the technical tools in order to do that work. So you've got the graph part which allows you to do the connectivity, you've got the HTP and you've got the whole Web stack, which gives you endless scalability. So those things aren't going to be a problem. You've got the tools there, but it doesn't do the job for you.
[00:34:34.430] There's not some machine learning algorithm yet. There's going to come in and clean and connect up all of your data. No, no, that that that job lies in front of you. And hopefully the decentralization process with the correct incentives and the correct regulated market and the regulation of the market is coming out the central data team. So they are they are defining these models. They are creating working groups of interested people and getting applications to publish into these models and conform to this enterprise ontology.
[00:35:05.390] They got to like kind of write applications and get them star ratings for the quality of the data that they're doing and encourage people to move towards. You basically have to create an ecosystem at how long that will take to do you know, it's a it's a big job, but, you know, it's a pathway forward from where we are now to where we need to be.
[00:35:30.200] Okay. Thank you. Yes, indeed. Ontology mapping and vocabulary mapping is one of the thorniest problems in this domain. So there's no way you could possibly address it now. I think we've already said enough and we're already a bit over time as well. So we have to wrap up things again now. We're trying to address as many of the remaining questions and answers, many of those offline. Thank you. Thanks.