Happy 8th birthday to the Maveryx Community! Take a walk down memory lane in our birthday blog, and don't miss out on the awesome birthday present that all Maveryx Community members get to take advantage of!

Data Science Mixer

Tune in for data science and cocktails.
Alteryx Alumni (Retired)

Constraints inspire creativity in data science. Vukosi Marivate of the University of Pretoria talks about doing NLP with low-resource languages, and shares how growing multinational collaboration is shaping the future of data science in Africa through innovation. 







Cocktail Conversation


Vukosi and his team have made innovative leaps when faced with limitations, coming up with new strategies for successful NLP projects, even with the challenge of small amounts of data. Have you had a moment when a data science challenge inspired and motivated you towards a new innovation?


Vukosi CC.png


Join the conversation by commenting below!




Episode Transcription

VUKOSI: 00:00

I had coffee for breakfast, and that was a bad idea. [laughter]

SUSAN: 00:03

[Oh, well?].

VUKOSI: 00:04

And yeah, yeah, and by the time I had lunch, my body was like, "Wow, food." [laughter]

SUSAN: 00:09

Oh, I think we've all been there. Yeah, I totally understand. Just a straight caffeine shot there. All right. Vukosi Marivate clearly enjoys coffee. As you may know, on Data Science Mixer, we encourage guests to have a fun drink or snack during our chat. And long after that caffeinated breakfast, Vukosi was having--

VUKOSI: 00:31

Yeah. Another cup of coffee to go. [laughter]

SUSAN: 00:35

More coffee.

VUKOSI: 00:35

I typically have my meetings in the afternoon after 1:00 p.m., so to keep things going, I need coffee, so yeah, yes. [laughter]

SUSAN: 00:45

[music] Vukosi needs coffee because you've got to have something to keep you going when you're busy building and training models with very little data, not to mention teaching and building multinational collaborative data science projects.

VUKOSI: 00:59

I'm Dr. Vukosi Marivate. I am the chair of data science at the University of Pretoria, in South Africa. And I primarily work at the intersection of machine learning and natural language processing. But I tend to put it at 60% of my group works in natural language processing, especially lower-resource settings, and 40% work in using machine learning or coming up with new machine learning methods, especially for societal [services?].

SUSAN: 01:27

Oh, fascinating, and I'm really excited to talk about those topics with you. Before we get started on that, do you mind also sharing with us what pronouns you use?

VUKOSI: 01:35

Oh yeah. I guess "he", "him".

SUSAN: 01:38

Vukosi's projects are fascinating and super innovative. He and his collaborators and students have turned the limitations of doing NLP with lower-resource languages into strengths. During our conversation, he tells us all about how Twitter and a TV show got him into data science, how societal and historical factors can affect machine learning models, and how he's helped create a community of practitioners and stakeholders that together will shape the future of data science across Africa. Let's jump in. So could you tell us a little bit about your career and how you got into data science?

VUKOSI: 02:22

All right. Yeah, after high school, I went to Wits University here in Johannesburg, and I was very much like-- the reason I went there was one of the schools, the engineering school there that had on their brochure that they did AI. And so, at that time, I was like, "Oh, okay, let me apply to this program." When I moved into electrical engineering, specifically a subset called information engineering, especially in my third year, I started doing more work in-- or classes in control networks and data, basically. And from there, you catch on like, "Oh, here's machine learning; here's AI." And by my fourth year, I would be a senior year in the US sense, I was really working in those spaces, doing I think my final year project was on the recommendation system. From there, it really was, "Hey, I think this could be a story of my life." On that part, my advisor, he encouraged me to do a master's degree with him, and then after that also to apply for external fellowships. And I ended up being a Fulbrighter in the US, at Rutgers, and doing a PhD in computer science, specifically in reinforcement learning. So I enjoyed that. I did reinforcement learning for my master's degree and my PhD because I was also fascinated in this part of how do you get a machine to learn from only being given a signal that is indirect, right? If you think about typical machine learning, in terms of supervised learning, you've got these examples where it's your input and your output. While with reinforcement learning, you've got a reward or a utility, and then the machine has to learn what it's supposed to actually be doing on that part. And then as I was doing that at Rutgers, it became more and more that the thing that got me to wake up in the morning as I was really interested in, "Hey, how do we actually make this practical for people," right? There is one part where you look at the theoretical underpinnings and you come up with interesting problems. But then there was the other part of, okay, somebody who doesn't know reinforcement learning as deeply as I do now or my lab, how do they get to engage with reinforcement learning? And as such, the final kind of topics in my thesis for my PhD, we're looking at that or how you do better evaluation where you don't have access to your environment, identifying how you can set up problems. You also estimate the uncertainty that you'll have in your reward, right? Given or what you will get with your policy and all the-- those are all in the direction of trying to understand what you could actually do for people using RL in the practical sense. And the use cases were things like education and health, where typically you are not going to allow a machine to experiment on humans in a direct way, [laughter] so you're going to have to do it indirectly. And through that, that's when I started learning more and more about this term called "data scientist", right? Because you are having these people who are getting data, having a problem, and trying to use different methodologies to get to solutions on that part. And yeah, I ended up liking it. Once I finished the PhD, I came back to South Africa, and then I was like, "Hey, I think I'm going to spend at least a year pretending to be a data scientist. Then let's see how it goes." And that was 2015. And here we are in 2021, and I'm still in that space. [laughter] And it's still very kind of enjoyable.

SUSAN: 05:44

Good. Well, the pretending to be a data scientist is going well. [laughter] So good job. That's so funny. Very cool. So the reinforcement learning work, is that something that you're still doing in your current role at the university?

VUKOSI: 05:58

Not directly. So that's not my biggest area anymore. But most of the other work, as I said at the beginning, is really in natural language processing. So in 2015, when I started on this journey of pretending to be a data scientist was this class I took during my graduate class I took doing my PhD where we had worked on networks. In that class, I chose a project of collecting Twitter data just before so many things were closed off by Twitter. So this is 2011, which is like very interesting days before they closed the firehose, at that time. I worked on constructing a network around this South African TV show that was edutainment. At that time, it was very new, where you had the TV show tweeting during the show itself when it's been broadcast and getting people to react and all of those things. I don't think even in the US you had any of that in 2011, any of that happening. But in South Africa, they were doing that. They were getting all this engagement. And I thought, "Hey, can you identify kind of who influential people were during the episodes and how they spread their message?" The show was mostly having to do with HIV and AIDS. But then I was like, "Oh, instead of looking at that network, I want to look at the network from social media and how to identify this propagation and how this propagation gets amplified or it dies down." I think I remember writing a blog post at the time saying, Finding Ashton, which, at that time, Ashton Kutcher was like, I think, the Twitter user with the most followers or something [in figures?].

SUSAN: 07:22

That's right. Yeah, I remember this phase, yeah. [laughter]

VUKOSI: 07:23

Back then, it was like, "No, yeah, how do you that?" But the thing I didn't use, at that time, was the text. I was just using these connections that this person was tweeting at this person with the hashtag that was connected. And I was like, "I need to learn about that stuff. How do I use the text later?" So in 2015, that was really in earnest when I started my journey through natural language processing, specifically through social media modeling and analysis. And as I kept on going down that route, that's when I started going like, "Oh, but how do you build tools like we're seeing right now with things like GPT-2 but for African languages?" And that's when working in a lower-resource setting became very important for me. And that's where most of my energy on a day-to-day basis is put towards is resolving situations. And then to get to the point where I think we have tools that are compatible to what we have for languages like English.

SUSAN: 08:18

Right. Right. I love the story. I love how your analysis of Twitter and kind of this public health focus from the edutainment setting has then led you toward analyzing text and really digging more deeply into that to find insights there and then to really learning and developing NLP techniques. So talk to us a little bit about that, some of the challenges that are faced in using NLP methods with lower-resource languages.

VUKOSI: 08:42

Yeah, we take it for granted that, "Hey, I can go download the Google [inaudible] that comes from the crawl data and then just connect it to my pipeline and do some transfer learning." But I spent some of this morning working on a paper on Setswana and some models there. And then you go like, "Well, the model--

SUSAN: 09:04

What can happen, right?

VUKOSI: 09:05

--and that we've been building it and tuning it and all those, and it takes a little bit more time and more thought. You can't just throw data at the problem. We just don't have that data. And even getting the data is not simply, "Oh, is it available?" So there's not as much that's available, but especially working in-- not in the West, you then start coming up and then also not working for a large company or university, and you start noticing inequality in a way. So, one, asking for data from people. So let's say companies that have their websites and this language, you start noticing that people will say no and say that, "No. We don't want for you to use our data in this way." But then you see a paper that comes from Facebook or Google, and they say, "Oh, yeah, we just crawl the internet," right? And then you ask them, "But, wait. We have these laws, and I don't want to get sued, and I don't want to get an IP strike on this part." So you have to really go through asking, and I work with my legal at my university a lot on navigating these things. So you can see, there's these literal worlds that stand in front of you. Then there's understanding the history of the languages themselves. If I'm South African, so I come from a place where for a long period of time, our languages were literally seen as being second rate, our local languages. So there was not much development in the universities. Their languages we're not even used for university teaching, and they're not developed kind of in that way. Other languages were chosen for the country, given our history of apartheid, to say, "Yes, we're going to develop only these ones." So now, if you're trying to play catch up, the amount of money that it takes to get back to that point becomes something of, "The money that we have currently as a country or the whole economy, do we spend it on developing these other, let's say, nine other languages that are in the country, which are official, or we spend it on other things in the country? Hey, we have poverty. We have all these other things." So you can now see, that's a second issue that comes along in there. There was a paper by Masakhane collaborators, [inaudible], Laura, and Jade, and they had gone in and tried to characterize that some of the challenges they talked about maybe [through?] access, these issues about IP. We've talked then about like, do you have money? The other one is like really also if I do start working in the space, am I doing my own career a good service? Because you're working on something where nobody really is interested in publications in that area and as such, you want to default to, "Oh, I want to build GPT-3.5 instead because it gets this very major access. And then there's also making sure, as we're working in these spaces, that you become as connected to where the language is currently and not just doing things superficially. I'm working on Tswana right now because that's my mother's language. My father speaks Tsonga. I also know that. But there is not that much data in Tsonga as much as Tswana, so I thought, "Okay, let me start here and deal with that one while I'm working on that." I think actually, let me see. Yeah, I too worked with a professor of theology at the University of Pretoria, and he gave me this book. It's a book of idioms in Tsonga and English, and it's not digitized [at all?].

SUSAN: 12:34

Can you tell us the title?

VUKOSI: 12:36

Oh, Vutlhari bya Vatsonga. So the wisdom of the Tsonga people, or the Tsonga-Shangana people. But what's interesting is that we, at the moment, don't have a digitized version of this. And that's one of the other thing, is that you might have these books, but they're not digitized. So a lot of the research that we do in the group tends to take into account as whether, "Hey, I don't have that much data. How do we deal with that? How do we build augmentation methods? How do we cleverly extend this data, or how do we find innovative ways to get more data? And how do we tune our models to work, even though there's not enough data, and then build on it? And how do we also help other people build up capability in actually getting more language data into their spaces?" So you'll find that, yeah, with a lot of the things we've been doing over the last few years, it's been on that, whether it's releasing I think we've got like Python libraries this Wednesday. When we're recording, we are about to release the Masakhane web tool, which will be like a research-based translation service, almost like your Microsoft translator, Google Translate, is just that it will now be solely for African languages--

SUSAN: 13:43

That's beautiful.

VUKOSI: 13:43

--and you'll be able to translate that masakhane.io because the Masakhane project came along as this big collaborative research project and the first task we took on as a community was done on translation, and this was back in two years ago. But last year, we then got a grant to my research group from Mozilla Open Source--

SUSAN: 14:06


VUKOSI: 14:07

--somewhat to build like the front end, to take the models and make them available. And from there, the reason it is a research project, so we're not prime time yet. So you'll see mistakes. But it also allows for people to give feedback on the translations, and then from there, these can then go back to-- the research has to then improve the translation models.

SUSAN: 14:28

And this is something that I thought was so interesting as I was reading through the Masakhane materials and hearing you tell this story is that because of the nature of the problem, it's such a participatory approach where you have so many different people involved in providing data and reviewing your model and providing that feedback. So how does that come about? How have you managed to gather so many people into that collaboration? I just think it's a really interesting approach.

VUKOSI: 14:52

Yeah. Yeah. So this is another story that starts differently, just like from there. So you go back to 2016. A couple of us came along African researchers, and thought like, "Hey, we should have a very big machine learning community. We think it's time. But how do you build kind of a community around machine learning and deep learning in such a way that we could shape it," right? Especially in the future where it was going to go instead of we're getting into this kind of situation again, where people are exporting or importing technology and they don't really know what's going on inside. And that's why the idea that deep learning Indaba started. We had our first Indaba in 2017 at my alma mater, Wits, and we had this event with like 300 people where the second one it's double, 600 people, and then at Stellenbosch University, and then the third one was in Kenya, in Nairobi also with 600 something people. But just before the Nairobi one, myself and a couple of people who had been looking at more and more into natural language processing, we already started having workshops at the conference. And natural language processing specifically was, "Hey, let's have an unconference before the main Indaba meet." And at that conference, we split up into groups. Some people went into translation. I led a group on social media, kind of worked on things like language identification, all those things. But at the end of that unconference, it became clear that, "Hey, having something like Masakhane - at that time, we didn't have a name for it - would be kind of the next step. And then people came along and worked it. And then by the time they-- so that was like a week before the Indaba, before we all flew to Nairobi for a week. And yeah, even at the Indaba in 2019, there were two days of workshops on natural language processing as part of that week. And then we came back and then Masakhane natural language processing was born. At that time, it was Masakhane translate because the first task that was taken was that, and then it brought in all these people. So it's still running, in terms of today. On Thursday evening, people meet up. I'm one of the chief investigators, and my group also provides support. And for example, last year we took on this task of building this web interface, on that part. But as you said, different people can assist in different spaces. There's people who are focused on getting new data. There's people who are focused on new tasks that they do on, but then it all is building onto this core, which is Masakhane NLP, on that part. I think one of the greatest things we got to, especially in 2020, and be around like-- I'm on the steering committee for the Lacuna Fund that is now trying to assist with providing money for people to create datasets. That's a big one, participatory paper. Then got the Wikimedia Research Award for 2020 as well. From different facets, we're making an impact on this part. Yeah, it's very interesting to see where we'll be in a few years as we grow, so yeah.

SUSAN: 17:44

Absolutely. It's an amazing project. I've looked through the website and read through a couple of the papers that y'all have produced and it's just really cool to see. I wonder if you could talk a little bit about some of the technical details and innovations that y'all have come up with and thinking about things like not being able to use trends for learning so readily and having to come up with creative ways of choosing your models or other approaches. What are some of the things that you're proud of in terms of those technical details?

VUKOSI: 18:10

Yeah, yeah. So I think two or three years back, we started looking at-- or in my group in this perspective, so yeah, I have to go-- Masakhane is big, so I can't speak for all of it. My business in this space, we started looking at augmentation and seeing in a low-resource scenario, what does augmentation actually mean? My group was interested in augmentation, especially in low-resource scenarios for listeners. If you think about augmentation and images, it's an easy way to visualize and anybody can get that. So you've got an image dataset that has a thousand images. Let's say it's pictures of planes, Boeings. You have air buses and all those things. But to make it more robust, one of the things that you do is that you can rotate the pictures, you can copy them, and you can change the color, change some of the properties in the pictures so that now you go from a thousand images to maybe a million. It's just that you've done all these weird kind of different things. We actually have one student in my group, at the moment, who's working on identification of solar panels in South Africa from satellite images, and she does exactly the same thing. You've got the pictures, but then now you do these rotations, these adjustments where you warp them slightly and then you go to nine million images. But you really only had-- your data set only had a thousand. So that allows that your machine and algorithm that you built becomes more robust. It will now be more generalizable and you can put it out there. In lower-resource scenarios, this is a boon because now you don't need a billion documents when you're trying to change any natural language processing. So how do you do augmentation in this case? So, one, is that, hey, you have a dictionary or a thesaurus. You can take a word and then replace it by another word in the thesaurus, right? That would be like, "Oh, it's great. All I need is a third one." So you need like-- in English, you can get this WordNet document and it has these semantic correlations between words and then you can use it on there. But the challenge with lower-resource languages, that doesn't exist.

SUSAN: 20:03

Right. That is the words for every language.

VUKOSI: 20:05

Yeah, yeah. So that is not something that's available that you can just consider a replacement that was on that part. So what we each and then looked and say, "Okay, if we can make a very strong word embeddings and the word embeddings give you this semantic similarity, can we then use that as a good way?" So we then, we showed kind of some good experiments in that space of showing that you can kind of build a lower-resource setting and use the transfer through documentation to actually assist you. So initially we had done it in English and we reduce it to a lower, at least, data setting in social media, in news, to a lower-resource setting, and then showing how that could actually do well. You can get close to what synonym replacement with a full WordNet with the synonyms would be or these other methods, other methods that you can do. It's not complete contextual. So you don't look at the full structure of the sentence. You're just choosing some words and you're replacing them with synonyms. But now you choose the synonyms by using a word embedding on that part. You can improve that, which is like into some further work we're still trying to do, by then saying, "Oh, I'm going to use a language model like BERT," where you have them fit in that language, and then looking at the full context. So you look at the full sentence. You remove the words, and then you can just use like a MAST language model to predict what the word that you've removed was supposed to be. So now you can go from 1 sentence to 10 sentences because you are removing different words. Right. Yeah. And what we ended up doing is then releasing a Python library text document with our work in that space and then people are using it. So now you can go in and just do a PyPI full text augment and then you have a text segment and it has all these different augmentation models or approaches that you can use, and then you can also use it in your specific language as long as you give it specific parameters. There's been some work with my collaborators in there to now have better WordNets for South African languages, for example. So now that was released, I think earlier this year,--

SUSAN: 22:02


VUKOSI: 22:03

--which is great in translation. I've had then a couple of students also looking at then, "Hey, how do we then use augmentation in a translation setting? Do you do augmentation on the English or do you do it on the target, or when there's a lower-resource language? Where do you do it so that you actually make the translation model a little bit better?" So that's ongoing work. Hopefully, we'll get some good results.

SUSAN: 22:25

That's terrific.

VUKOSI: 22:25

And yeah. As that time goes on, we've done a lot in gathering new data. So the project I was talking about was gathering new data, doing the annotation in the local language. Once the paper is out, it will come with the paper, the data, and any of the tools that were required to actually get to what we were able to do. In some ways, we also very much big on making sure that the tooling is available openly, and the reason that other people then can go on. We saw this with the text augment like in determining how many people are using the lab. We never planned for that. And--

SUSAN: 23:00

Why? Why?

VUKOSI: 23:01

And on my end and on my collaborators' end, and at the CSR, which is our national lab, people talk about, "Hey, we're doing machine learning or we're doing natural language processing and we should release the data. We should release the software that we do." It's one thing to talk about it. It's a different thing to do it. And then on that part and through going through these exercises for the last three years, now it's almost in the research group, it's like given. Students and members know, as you're wrapping it up, you make sure there's the data statement and you fill it in. If there's going to be notebooks that have to be out, make sure that they are available on the GitHub, and on our group GitHub, and then they're documented on there. And then once the paper gets accepted, those things go with it. But to get to there, you have to kind of be doing that continuously, even with the Masakhane web tool, which is just the front end, then there's infrastructure that will allow us to serve these [jury?] NMT translation models. For the Masakhane community, we basically released the tool open-source about two weeks ago. "And now here's the website. But you can actually run this yourself. Here's all of the code. And here's how you add the languages. Oh, your language is not there, and you want to work on it, here's where you go to Masakhane and you can train your own model, and then add it to there. And once that the score becomes over some threshold, then you can ask us to add it onto the translate tool, then it becomes available for everybody else to be able to try out and see. And you can even run an API. The system has an API that you can then serve yourself if you want and to say, "Hey, I'm serving Igbo translation service on my website, and then here's the API." All of that is available. But it was from the beginning, we were intentional that that's how we design things.

SUSAN: 24:51

Right, right. Yeah, it's really interesting. I certainly noticed as I was looking through some of your work, just this really strong emphasis on sharing and collaborating and making sure that those resources are available, and that is the path forward for the particular challenging problems that you're facing. So, yeah, super important. So in the vein of challenging problems that you're facing, there's some other really interesting work that you've done just around data for social good, kind of those social impact kinds of projects. Do you have one of those that you'd like to share with us?

VUKOSI: 25:21

Yeah, sure. There's been a couple. Things change over time. I think we took some time looking at kind of public education data and trying to look if you could use interpretable machine learning to identify factors that lead to good performance and that are done with South African data from a high school perspective and then also Sierra Leone data with one of my students. That was very interesting because, yes, again, you take some things for granted if you're in different places. South Africa comparatively has a very good national statistics office and some data is very good. It's easy to identify, "Hey, I'm looking for population data. I'm looking for health data. I'm looking for education data. How do I connect these three?" Or, "I saw in pandas you're doing some joins and things like that," but you can do it. And my student worked on things like that. But if you go to other countries, this becomes a little bit harder because do you have a good population data? Do you have good health data? Do you have good education data? Do you have good socio-economic data? And then when you run the interpretable models, you can then identify, oh, factor like, hey, a school you know has a very good cafeteria, tends to correlate very well with the performance being on there. We're not talking about causation at the moment, but then we're trying to then go and say, like, "This could be very useful for policymaking." And now, the thing is, how do you get to policymaking? How do you then collaborate with people in that space, in government, for them to understand these things? But we've done a lot of work on that. There's been a couple of papers of what it means to-- do you have the right data? Do you have the right people that you're talking to and got the paper? We've got the machine learning model. But then, in this case, that it then leads to people changing the way they do decision-making. The biggest project yet has been on COVID-19. It doesn't involve any modeling, but it involved us as a research group making sure that we created a data repository for South Africa, then made available COVID data for free and in an open way, right? Because even today, unfortunately, again, since the pandemic hit South Africa, we are still the only open repository that's there. So you can create your own pipeline for modeling where you just then basically point to a CSV on our repository and the rest of your pipeline can-- and the reason that we did that and focus specifically on that was we think there's people who have done this all their lives. They do modeling of pandemics and diseases. They just need to get the right data, and we need to package it in a way that does as least friction as possible. And we've worked on that for the last year and a bit, and that's brought on lots of other collaborations, people working again in distributed way, making sure everything is available. And you can see then of whether it's visualizations that are available to decision-makers or people who are doing modeling in their institutes and then using it to push for decision-making, or then the papers that come out from other groups that then say, "Oh, we use this data set." It also obviously makes it more reproducible, right? Because people then can say, "Oh, I'm going to use exactly your model on this data and then compare it against my model on the same data. In a way, and also it's given us more eyes." It's very nice that every few weeks you get an email saying, "Oh, I found an error on this line," or "Involve the data in this way, can you fix it?" And then I go in and then it gets fixed. But yeah, that's been something of a revelation because for years, from 2015 or so, I've been in conversations with decision-makers in South Africa or people in more the political realm of saying like, "One of the challenges we're going to be have in this space when you're thinking about machine learning or AI being used in general, is do we have the right infrastructure to build these tools on top of them? And if you don't have that, you can't talk about AI. You can't talk about machine learning. And COVID, in some ways, internationally, showed the brittleness of the underlying systems. That it doesn't matter. You can build all of that because the data infrastructure is so bad and, yeah, because like, "Oh, finally, now I can say that people understand." [laughter] Because they'll say that people go like, "No, no, no. It's magic. AI, we'll solve it." [laughter] And you're like, "No."

SUSAN: 29:29

There's this one ingredient we have to have.

VUKOSI: 29:31

Yeah, these things where something is like policies and procedures that you have to have beforehand that allow you to build data products. So if you're a data scientist, this is a sometimes a blind spot that we have because we think, "Oh, I have this tooling. I have my training in statistics. I have my training in computer science. I'm going to solve this problem." And you're like, "No." [laughter] When you get into the real world, we have to deal with this. So a lot of our research group at the University of Pretoria deals with these things as part of their PhDs, as part of their master's, as part of their research. It's you go in, and say like, "I might do this, but then I'm actually directly working with a stakeholder, right? I am not working in a vacuum where it really reduces me and everything to just a nice CSV file. I get a lot of funding on it and this is stuff that--" he's like, "No, no, no. In order to do this, it might take me a few more months than normal. I'm actually going to interact with them there." And then by doing that you get to really understand the problem in a more deeper way. And you also are likely to come up with models or solutions that are really connected to people saying, "I want to pick this up and use it."

SUSAN: 30:40

Yeah, it's so interesting. I mean, I've talked to a number of data scientists for the podcast and other purposes. And of course, folks are like, "Oh, I have to clean data. Oh, it's just so terrible." But to think about well, no, actually you have to go get your data in the first place and build it from the ground up. I mean, it's a whole nother setup. So it's really interesting to hear about this issue of just getting it in the first place that you're contending with.

VUKOSI: 31:04

Yeah. And then in a lot of situations, typically what happens, I teach in the master's program for data science at the University of Pretoria. And one class I teach is our data science capstone, which takes all of the courses that the students might have done as core and puts them into a use case. And the use case is always with a partner who we believe might benefit from data science, and they also want to take this data science journey but they don't come from the CS department. They come from somewhere else in the university. And for a lot of the students, what they notice is that the partner will write the description and say, "This is what I would like to be done. This is the data that I have." But once they have their first few meetings with the partner, they notice that what's written on the paper and what actually they have is different, and then the students will gripe. They'll say, "My data is not as-- I thought I was going to get this." And this is exactly why we do this.

SUSAN: 31:58

Yeah, for sure. Great learning experience. [laughter]

VUKOSI: 32:03

Right, because the reality you're going to face wherever you go is that you're lucky that we have data. We definitely have data. But then once the reality actually goes in, now it's like, "Oh, how do I put together the machinery to get this into my format? Once I record that, is the question that we're asking that we need an answer to for the decision-maker actually possible from this data? Or are we going to have to find more data to connect this in? Or is it completely like a really bad specification of this problem," right? And those questions come in of like, "Yeah, before you go in and you throw any algorithm at this, let's actually spend time understanding the partner, understanding the data, understanding the assumptions that everybody was making before we go and spend hours and hours trying to do modeling.

SUSAN: 32:58

Absolutely. Yeah, the basic foundations have to be there before you go to that next step, for sure. Great experience for students. So you've kind of alluded to this already in some of the things you've discussed. But one of the things that I thought was really interesting in one of your papers was this issue of integrating societal factors, was the way you put it, into machine learning models and thinking about how those impact the actual structure and outcome of the models. Can you speak a little bit to that, what it would mean to incorporate societal factors into model building?

VUKOSI: 33:33

Yeah, thanks for that question. I think some of it we've talked about in just that how you specify the problem, it cannot be in a vacuum. I like sometimes collaborating with our informatics and information science professors because they take into account a lot of how does the system actually exist in the world? At the same time, there's also-- especially in thinking about ethics and bias, you need to kind of think about the more unintended consequences of your models, right? So this is important because you do seat into these conversations. As I said, that the information power is asymmetrical. So people think you know machine learning and AI. So you're a magician. So you come in, you sit in a room, and they say, "Here's our problem." They nicely kind of write it down. And they say, "See here, you add your AI," and then [poof?] [laughter].

SUSAN: 34:26

And a sprinkle of AI and [all things could be--?].

VUKOSI: 34:27

Yeah, yeah, yeah, yeah. But then we have a solution, and then it's like, "Wait. But then we need to--" you, even as a designer or as a modeler have to start asking the right questions of saying, "Okay, what actually happens on a typical day? What's actually going on in the place where this thing is going to be used?" And typically then that requires you to actually immerse yourself in this? I like this thing of like journalists when sometimes they're doing these long stories, they'll say, "Hey, I want to do a ride-along and I'm going to spend a day or a week immersing myself in this space so that I can really understand what the heck is going on." And through that, you typically then find where this picture of-- and then you just pour AI into the cup and then everything works, that there might be unintended things that might happen here because you do that, and that's where then you adjust, right? Then you say, "Okay. No, no, no, no, no, no. We can do A and B, but then C, from what I have noted, this might be a problem. And can you can we talk a little bit more about C?" Let's say it's something like it's an automated decision that people will trust. And we know this. We've got enough work in psychology that looks at this that humans tend to tend-- something like to trust machines, even if machine's spitting out rubbish, right? So we need to then say, "If this automated decision happens, it shouldn't be taken as that. Then how do we stop people from thinking that this is a "do this" instead of it might be a recommendation. I have some uncertainty about this, and it might mean that you either add a layer that doesn't show that decision directly, but then allows it." That then requires a different skill set. Who are the people who can design whatever, the mechanism in order to actually present that result in such a way that people won't take it on? And that's when you need to-- that you end up being in this, "Hey, I need multiple people on this team, not the AI engineering." That's necessarily to be able to build those pieces because if this was taken as it is, hey, half of this neighborhood would get citations for something that likely was a quirk because somewhere in the system, somebody was recording things because their system is from 1995 and the only way it used to get data was in this way. And they had to then say, "Oh, yeah, we have this new--" I'm just making up a use case here. "We have this new tax that we charge, but we couldn't write it in. So what we did was we added in this weird way. A human, if they read it, would understand it. But then if a machine reads it, they'll just be like, "Oh, this person owes a lot of money." [laughter] And as such, they must get penalties from the municipality, kind of in a way. And you won't know that. You won't know that unless you immerse yourself in the space for a little bit, for some time. And even then, it is very superficial. You are just starting to scratch the surface because each of these systems that become data come from their own history, their own background, their own biases. And if you just automate that, you're going to make mistakes.

SUSAN: 37:31

Yeah, humans certainly still have their place, right?

VUKOSI: 37:34

Yeah. Well, we design the systems. And we also figure out ways to-- what is it? Have ingenuity and getting them to do things that they were not designed for.

SUSAN: 37:43

Yeah, absolutely. What are some of the things that you're excited about right now, in terms of technical innovations that you're working on, a project that you have coming up, something else that you're going to do in the near future that you're looking forward to?

VUKOSI: 37:57

It's been a crazy journey. We thought 2020 was a crazy year; I think 2021 will be even greater. So I don't even know what 2022 is going to be like. So there's work in NER translation, obviously that's going on. I think a couple of the groups have gotten funding from Lacuna Fund, so I'm interested in seeing what comes out of that and making data available. And I think it's building that critical mass that there's a lot of researchers across the continent, and also they're collaborating globally with people. Really, we're getting to the point where the potential is unlimited, on that part. I don't want people like to see us in my group and say, "Oh, those are like one of the few people who are doing it." We really want it to be as many people as possible just because it obviously raises the bar and keeps on raising it. In terms of how we all work together and collaborate, we end up choosing slices that we can take on at a specific time. But with something like Masakhane, it's got this kind of amplifying and magnifying opportunity that then every time I go in and I see some of the papers that are coming out of the collaborators and I wonder, "Wow." I never would have thought about this two years ago." For my group, yeah, we've been growing. I hope in the next few months and years, more PhD graduates, and then they go out and do awesome, awesome things where we're playing very hard to make sure that we're getting our work out there. We're shedding as much as possible, and then looking at other ways that we can take these innovations almost to market so that there might be now a few more. I think with the Mozilla Award, it was great to be able to then think of a system as opposed to just that we built the model, but now there's a model system. Now the next thing is how do you make systems then that might be semi-commercial, right?

SUSAN: 39:41

For sure. For sure.

VUKOSI: 39:41

And available to people, and then the resource question becomes less of an issue, because then you can get money from availing a service, and then from there gets more research as to then work on that service. So those are things that might be coming along. I think, for this year, the IndabaXs are coming for deep learning Indaba because I don't think we're going to have a physical meeting for our conference. But the smaller IndabaXs, which are more localized workshops are coming. I think we're funding 23 this year, and we have, I think, three new countries that are going to be part of this across the continent. So if you go to deeplearningindaba.com and you go to IndabaX, throughout the year, there's going to be events on that. If you're interested in really connecting to the African continent and machine learning, that's, I think, the place to meet people. And yeah, the rest of the year is trying to get work done. This project has just been years in the making. I've started also seeing if we can get news, local-news data and then annotate it. And I think now we finally have it, four years later, and then now, I'm just trying to get the paper written so that I can finish it. It's a four-year journey. So, yeah, I think that's pretty much it.

SUSAN: 40:48

Yeah. Well, that's that's a great sampling of exciting things going on. So very cool. So one question that we always ask on the podcast, and I'll ask it to you now. This is our alternative hypothesis segment. The question is, what is something that people often think is true about data science or about being a data scientist, but that you have found to be incorrect?

VUKOSI: 41:09

That everybody knows what the term data science means. Yeah. And that it's actually the same as machine learning and AI. One of the things when I'm talking to technocrats and policymakers, I always have four slides, in the beginning, explaining all of those things.

SUSAN: 41:29

That's great. [laughter]

VUKOSI: 41:30

Yeah, yeah. And then, so that by the time you're discussing the rest of the stuff, like we've talked about now, they kind of understand what it. People will nod and say, "Oh, yeah, data science. And then you go off and you say stuff and then you think like, "Yeah, they understand." Then you're like, "No. They have no idea what you do." [laughter]

SUSAN: 41:49

Yeah. I had a conversation recently with somebody who was like, "Data science, okay. Yeah, actually, I don't really know what that is." And they just sort of admitted it sheepishly, like, "Okay, I guess I have to ask for an actual definition." [laughter]

VUKOSI: 42:02

I used to use a lot of this because now I have a position as chair of data science. So I use the whole pretending data scientist thing before this position to allow people to disarm that part, to say, "Ask that question. And then I'll give you a definition that I believe is closer to what I do so that you can understand it in our interactions." Because I remember I interned once at Newton Inc., in New York, which is an Edutech startup, which I think still exists, but not in the same way that it was. This was in 2013, on that part. And when I finished my PhD, I did this thing while standing around and talking to old mentors, and one of them was my data science manager at Newton. And then I said, "What is the biggest thing that you would give me as advice for me. I'm planning to go back to South Africa. I want to do data science. I want to build my own teams and things like that." And he said, "Language, in that you must come to a point where the things that you say to each other when most of the time you're talking to people who are not data scientists, is that the definitions are the same. Because you can't go down that road and do a lot of things for months or years and then find out that you actually did not have a common understanding, and all of that work was a waste."

SUSAN: 43:23

[music] Right. Yeah, that's a great point. And again, it gets back to those basic foundations right beneath all of this work. Thanks for listening to our data science mixer chat with Vukosi Marivate. Join us on the Alteryx community for this week's cocktail conversation to share your thoughts. Vukosi and I talked about how he and his team have made innovative leaps when faced with limitations. Coming up with new strategies for successful NLP projects, even with the challenge of small amounts of data. Have you had a moment when a data science challenge inspired and motivated you towards some new innovation? Share that exciting moment with us. Post your thoughts in a comment directly on the episode page at community.alteryx.com/podcast or post on social media with the hashtag Data Science Mixer and tag Alteryx. Cheers.




This episode of Data Science Mixer was produced by Susan Currie Sivek (@SusanCS) and Maddie Johannsen (@MaddieJ).
Special thanks to Ian Stonehouse for the theme music track, and @TaraM  for our album artwork.