Data Science Mixer

Tune in for data science and cocktails.
MaddieJ
Alteryx Community Team
Alteryx Community Team

Data science, satellite imagery, and ... penguins? Dr. Heather Lynch, professor at Stony Brook University, joins us to share how she uses data science to study penguins and other species in Antarctica, with surprising connections to business and other fields. 

 

 


Panelists

 


Topics

 


Cocktail Conversation

 

What's your favorite example of how images have provided special insights in your own data science? Or alternatively, what's a way you'd like to use images and computer vision in the future? Maybe you have a creative source of images in mind or an innovative analytic method you're cooking up that you'd like to share.

 

Join the conversation by commenting below!

 

Mixer LI.png


Transcript

 

Episode Transcription

SUSAN: 00:01

[music] Here is a list of words and phrases that appear in this Data Science Mixer interview. Penguin Guano, quantitative finance, whales, remote sensing, studio art, transfer learning, TV show, pilot episode. Welcome to Data Science Mixer, a podcast featuring top experts in lively and informative conversations that will change the way you do data science. I’m Susan Currie Sivek, Senior Data Science Journalist for the Alteryx Community. So how are all those words and ideas connected? That wasn’t just me doing free association, I promise. Today’s guest is Dr. Heather Lynch and her fascinating work in ecology at data science actually does involve everything I just listed. If you like animals, if you are interested in creative uses of computer vision and publicly shared photos, if you’re curious about how satellite imagery can be used to study distant phenomena, or if you just want to be among the first to hear about the perfect TV show for the data science audience, this episode is for you. I promise it will get you thinking in surprising new ways about familiar ideas. And yes, there are penguins, too. Let’s meet Heather and get right into it.

HEATHER: 01:21

So my name is Heather Lynch. I’m the IACS Endowed Chair for Ecology and Evolution at Stony Brook University, which is a huge mouthful. And I use the pronouns she and her.

SUSAN: 01:31

Awesome. And what is IACS?

HEATHER: 01:34

Good question. It’s the Institute for Advanced Computational Sciences here at Stony Brook. So it’s an interdisciplinary institute that sort of munches together computer scientists and linguists and ecologists and material scientists. Anybody that uses computers and does computational research. So it’s kind of perfect for me since don’t fit neatly in any one box. So it’s perfect for that kind of interdisciplinary work.

SUSAN: 02:00

Awesome. Yeah. I would say that’s probably a characteristic of most of our guests on Data Science Mixer that we don’t fit neatly into boxes. So we love that around here. Very cool. Great. Well, thank you so much for that info. So could you tell us a little bit about what your main research focus is right now? I know you’re working on all kinds of awesome things and maybe highlighting some of the main data science aspects of that work.

HEATHER: 02:23

Sure. So the big question that we work on in my lab is we are trying to figure out how many animals there are in the Antarctic, how their populations are changing over time, and why that is. And so we primarily focus on penguins. So Antarctic penguins are our bread and butter, but we work on other seabirds. We work on seals and whales. And this seems very far removed from data science, except that, first of all, essentially, we’re just dealing with counts. We get a whole bunch of data on how many penguins are at a colony over time. And these data are patchy and they are very limited. So some sites may have two counts over four decades. Some sites might be right next to a research station and we have very good data. And so when I became involved in this as a postdoc, this was perfect because these data sets are someone that only a mother could love, in a sense. I kind of joke that it needed somebody who saw this as a feature, not a bug of this research, which is that it needed somebody with some real quantitative skills, which is what I was able to bring to the table. So if this penguin data were easier to work with, I wouldn’t get to work with it so I’m kind of glad that it’s very difficult.

HEATHER: 03:34

So that’s one angle is just, not simple but pretty straightforward data science as to how are these populations changing? And you’re trying to put together some really scattered information. But the other big branch of our work is actually using satellites to survey these animals. And that gets us into computer vision and how we can build sort of AI tools to search for whales and seals and penguins from space. And how do we munch that together with the data that we have from our fieldwork? And so there’s this whole other element of it, which is applying all the tools of computer vision to this problem of how many penguins are there in Antarctica. So there are a couple of different aspects here that really touch on data science. And as a result, the graduate students in my lab come from five different graduate programs. So we have people in the lab that come from applied math, or they come from computer science or electrical engineering or marine science or ecology and evolution. And they’re all working together on these problems to figure out what’s happening to Antarctica’s penguins and what might climate change bring in the future.

SUSAN: 04:40

Wow. That’s fantastic. It’s kind of, again, that theme of not fitting into the box there, but needing all of those different skill sets and different backgrounds to deal with these tricky questions. That’s terrific. And I’m sure that a lot of the folks who are listening to this, who are maybe in business and other areas are identifying with some of the things you’re saying about sparse data, difficult to deal with data, but they don’t, unfortunately, get to study penguins with it, maybe. [laughter]

HEATHER: 05:01

It’s funny. If we think of all the STEM areas as being like a surface, like an energy surface, it’s really flat for me. And it’s like there are no areas of science that I really like better than other areas of science. And I went to a science and technology magnet school. And so I focused on chemistry. And it was natural that when I went to college, I would major in chemical engineering. And then I really fell in love with physics. Again, sort of sliding around this energy surface. So I transferred from the engineering school into the arts and sciences, so I could major in physics. And I finished in physics and I went to graduate school for physics. And sort of three years into my PhD realized that, from my perspective, and this is ever more true now, the environment was sort of falling apart. And it felt really important to work on those problems. And there was clearly a niche for people that came with strong quantitative skills. So I made the very hard decision to leave the physics program. As anyone who’s been in graduate school can imagine, it’s very hard to just walk away from your department and right into another department. So I made a very rare lateral transfer. I went from being a fourth-year graduate student in physics to a fourth-year graduate student in biology with essentially no biology under my wing at all. That I managed to con the graduate school into that I’m forever grateful.

HEATHER: 06:25

But I went ahead and put together a PhD which leveraged my strengths, which is probably just good advice to anybody. So it was a very narrowly tailored question about the interaction between insect outbreaks and forest fires, but it was using data that already existed. It needed some math thrown at it. But as it turns out, there’s a real niche for people who like the numbers’ part. Whose interest is in the ugly, messy data that makes other people want to pull their hair out. If that kind of challenge appeals to you, then it turns out there’s a lot of really exciting research questions. And it was really that I was coming from that experience of working with really messy data that it wasn’t so daunting to take on this penguin work that I continue to this day because dealing with messy data became my hallmark. So I now run a big research lab, and I have graduate students that very much come in as seabird biologists. They are sort of native seabird biologists. But that’s sort of never been the angle that I took. And so I kind of joke that when we’re in the field, we count penguins. We do it one by one. It’s not rocket science. We literally just count them. So my students or sometimes I’ll be with collaborators that’ll be like, “Oh, look, the chick is hatching out of the nest.” And I’m like, “9, 10, 11, 12.”

SUSAN: 07:48

Another one to count. [laughter]

HEATHER: 07:49

I can stay very focused. Another one to count. It’s wonderful and amazing, of course. But the part of it that I like the most is coming back with a big hard drive of data and seeing what we have. I like being out in the field, but people do that better than I can. It’s not my comparative advantage. And at the same time, really getting to dive into the numbers. When we get back, it’s like Christmas every year.

SUSAN: 08:14

Oh, yes. Yeah. I think many of us feel that. It’s like, “Oh, fresh data!” Very exciting.

HEATHER: 08:18

I know. You never know what you’ll find. I’ll share an anecdote that I’m sure many of your listeners can relate to. When I originally took on this project as a postdoc, the data were being kept in an Excel. And so it was this big Excel spreadsheet of thousands of rows of colony names and counts and column after column of data. And the year that when I inherited this Excel spreadsheet, I discovered that the year before somebody had sorted one of the columns independent of all of the other columns in the spreadsheet. That question that you get, “Do you want to expand the columns?” And they said, “No.” And so, as I set about having to do this forensic reconstruction of this database, which took the better part of three years, going back to the original field notebooks and literally asking people, “Oh, you were at the site. It was Christmas Eve. Maybe that helps. You said it was raining. Do you remember?” And so after I spent three years reconstructing this database and ultimately moving it into something more appropriate, decided that I had to do this for the rest of my career because I dedicated so much time to setting things right again. But it remains sort of the nightmare of the data scientist. Is that column will become scrambled relative to the others, and you’ll have to start from scratch.

SUSAN: 09:43

Oh, yes. That is a nightmare. Oh my gosh. I’m sure we’ve got people breaking out in cold sweats right now, just hearing that. [laughter] One fateful click.

HEATHER: 09:52

I know. I will say, rest assured now everything is fully version controlled and there’s checks and balances to every last little change so that that mistake will ever be made again. But it was an important lesson early in my career. And it wasn’t even my mistake, but just having to pick up the pieces, I think, taught me a lot about what it means to do reproducible research.

SUSAN: 10:12

Absolutely. Well, it’s so interesting to hear the background on this. And your penguin research is actually how I heard about your work, to begin with, attending a webinar where you were giving a talk on it. And I’d love to hear a little bit more about that. The project where you were looking at penguins’ nesting areas and using drone and satellite imagery to help identify where they were and how many, of course, and so forth.

HEATHER: 10:36

Yeah. So one of the things is that when I was doing my PhD work, I was looking at insect outbreaks. And what I realized or whatever, I probably wasn’t the first to have this thought, but that we could see insect outbreaks from satellite imagery because they turned the trees bright red. And that was one of the ways that I was able to track insect outbreaks. Now I go to study penguins and from my perspective, I think I’ve left all this remote sensing behind. But it turns out that we can see penguin guano, which is the excrement that the penguins leave right in the vicinity of their nest. We can see that from space, too. So just like we can’t see insects, individual insects from imagery, but we can see the effective insects, we can’t see individual penguins and imagery, but we can see the outline of their colonies. And penguins want to be exactly one pecking distance away from their neighbor, which means that we can estimate the density of their nesting. So if we have an area, we can estimate how many individuals were nesting in that area. And so we don’t have to be able to see individual penguins in order to get a good estimate of their abundance.

HEATHER: 11:38

And so once you figure out that you can use satellites to look for penguins, the whole world opens up in terms of your ability to monitor them. Because no longer are you tied to those very few places on the continent that you can actually get to by plane or ship, you can actually just start scanning the Antarctic continent for penguin guano, and you can monitor all of Antarctica’s penguins over time. And so that really opened things up for us. But for a long time, almost a decade, we were manually literally drawing polygons around guano. Manual annotation of all this imagery. And that doesn’t scale because there are just only so many people on the planet that know what guano looks like in imagery. You can’t mechanical turk your way out of this problem. So you have to find some way of training a computer to do it. And that’s what we’ve been working on for quite some time. It turns out to be a surprisingly difficult computer vision problem, but we haven’t given up. We’ve made some progress.

HEATHER: 12:37

But drones is another big piece of the puzzle. So increasingly, of course, drones are now exploding in their use in sort of environmental applications and ecology. But when we get down there, the ability that drones give us to survey very large areas in a short period of time is pretty unprecedented. And so when we’re trying to serve a colony that might have a half a million penguins, a drone will allow us to do that in a couple of hours. And if we do that counting one by one, the way that we always have in the past, it might require four or five days. So we are sort of linking together what we’re individually counting one by one with what we can do from drones with what we can from satellites. And we’re kind of eventually working our way towards a monitoring system that actually allows us to sort of have an early warning system if these populations start to change unexpectedly.

HEATHER: 13:32

I sort of dream of the day that we will have sort of penguin forecasts the way that we do have traffic forecasts. So I’m on Long Island. I would not drive into Manhattan without looking at my phone and seeing where the traffic slowdowns are. It’s doing this predictive algorithm to figure out what the traffic might be. It’s conceivable that we’ll get there with monitoring penguins, for example, where we will know exactly-- we’ll have streaming data sets for how many penguins there are at all of the hundreds of colonies and what our models say is forecasted for the next year. And if it’s widely different, then we can investigate further. But it will allow us to focus on the causes of these changes that we see rather than just focused on the mechanics of counting them, which is what we’ve been doing. And not me alone, obviously, but collaboration of people over the globe, 40 or 50 years of counting penguins. And that has sort of sucked all the energy out of the community because that’s been the focus. And now, we can say, “Okay. We don’t have to worry so much about the counting. We can really focus all our energy on understanding the causes of those changes and then hopefully we can obviously prevent the declines that we’re quite worried about.

SUSAN: 14:48

Sure. Absolutely. Can you tell me a little more about the forecasting and what you see as the goal of prediction in that process?

HEATHER: 14:57

Yeah. That’s a great question. In an ideal world, we would have a forecasting algorithm that would allow us to say maybe that there was forecasted changes to the climate or the weather on very short timescales that would be detrimental to penguins and that we could just turn krill fishing, which is where we’re fishing the krill out of the ocean that the penguins depend on, that we could turn that down and that we could have a more responsive management of krill. Because if the penguins need krill to survive, we want to take only as much krill as we can before we have an impact on the penguins. And forecasts are one way that we can try and predict how much krill we need to leave in the oceans for the penguins. Right now, it turns out that penguin population dynamics are unbelievably difficult to forecast accurately. I was on a call last week and I said only sort of partially kidding that I’ve never gotten anything to correlate with anything in my entire career.” Because we set a very high bar for ourselves that we want to be able to say, “How many penguins are there going to be next year?” And actually, do a good job [with?] that.

HEATHER: 16:07

And the problem is that the penguin populations are fluctuating wildly for reasons that we don’t understand. And while we understand their long-term trends, we would like to understand these short-term fluctuations. And so the analogy that I give to some people that might be doing a similar kind of work in the financial sector is we can average over all those fluctuations by looking at larger regions, which would be like an index fund. So we can say, “Well, how are the population dynamics changing over the entire Antarctic peninsula?” And we can average out a lot of the noise and we can see the signal. We can see, “Oh, their populations are declining or their populations are increasing.” But the detailed fluctuations of those individual populations it’s highly stochastic. It’s very difficult to forecast, which doesn’t mean we don’t keep trying. We did, at some point-- this was several years ago. We were having difficulty, as we continue to do, making good short-term forecasts. And we said, “Well, maybe we’re just bad modelers That’s possible. So why don’t we put these data together and make a data science competition out of it? And can we engage the data science community?”

SUSAN: 17:16

Yeah. Super cool.

HEATHER: 17:17

Yeah. And so it’s like, “Maybe all of our knowledge about penguins actually isn’t helping us make good short-term forecasts.” So we put together this data science competition. We had over 660 models, I believe, submitted. We had prize money for the best forecast. We actually had a prize for an out-of-sample forecast. So it would be, “How many penguins are there next year?” So there was a year lag to the prize money because we would have to go down and count. And of the 660 submitted models, our model that we had worked on came in, I think, third or fourth. So on the one hand, it means that we’re not actually terrible modelers. It means that our poor predictive performance is not blown away by some data scientists who can look at this with fresh eyes. But on the other hand, all of our domain knowledge about penguins isn’t helping. Standard forecasting time series techniques, one of them used the forecast package that I think was developed by Facebook. Standard time series techniques actually do a pretty decent job and understanding something about penguins actually doesn’t add much to the puzzle when we’re talking about a one-year out forecast or a two- or three-year out forecast. So it was both encouraging and humbling at the same time. But it was great to see the data science community assemble around a data set that we live and breathe and wake up every day and think about. And to get a whole bunch of other people really excited about that was a lot of fun.

SUSAN: 18:40

Yeah. Absolutely. Yeah. That was something else that I’d wanted to ask you about and how the results had come out for you. How was the one-year out prediction? Has that come to fruition yet?

HEATHER: 18:51

Yeah. So we had a winner. I can’t remember which of our-- we ended up writing a paper on this and the winning models were co-authors on that paper. So we did have a winner. I will say we learned some of the challenges of running a data science competition on this kind of time series data. And one is that it’s-- we have very small datasets. So there’s a lot of focus on big data. If you’ve got 1 million records or 2 million records. There’s relatively little focus in the data science community that I see on smaller data sets. So there wasn’t as much interest. It took us some time to find a host for this data science competition because it didn’t look like the kind of big data competitions that they’re used to hosting. And this was what would be a huge ecological data set, but a very small data science datasets. So that was one challenge. And the other is when you’re running these data science competitions, we had to scrub our own websites of some of the data that we wanted to use for testing because there are people that will screenshot the internet. So in some sense, if that fact has been on the internet in the past, there are ways to game the system so that your model will predict that number. And if we had known that ahead of time, we could have withheld that. I don’t think it was a problem, but we were like, “Of course.” The way that machine will tell people how many penguins there were at this site because we had published that online before. So there were some things that when other ecologists are asking for advice that I give them and one of them is making sure that you keep enough in your back pocket in terms of a good test of these models that hasn’t already appeared online in some format.

SUSAN: 20:33

Makes a lot of sense. I just think the whole thing is such a creative and fun idea, opening it up to the entire community and seeing what people come up with and what might be of use. So it’s a really cool approach.

HEATHER: 20:44

Yep. And just the range of expertise of the people that contributed. We had someone that worked for a water company. We had a physicist from Oxford. We had people across the entire range. And I think it speaks to how data science, in many ways, there’s a common language, even though people in their day jobs are working on very sort of disparate applications.

SUSAN: 21:04

Absolutely. And I noticed to, I think you have a citizen science kind of project as well affiliated with the penguin research that people could participate in?

HEATHER: 21:13

Yeah. Absolutely. So we have the section of our website called Be A Penguin Detective. And so what we do is we teach people on the website what to look for when they’re looking for penguin guano. And they can just go to Google Earth and load in the file that shows where all the pigment colonies are that we know about. And if they find one that we don’t know about, then we’ll track it down. So it’s funny you should ask because just yesterday I was going through all the leads, I sort of think of it as the tip line, and writing back to people that had written us from all over the world and responding to them. So in some cases, I can look at that and say, “Okay. It looks like penguin guano, but it’s actually algae that grows in the snow.” In other cases, though, there’s a colony, for example, that I think is a new unknown Gentoo penguin colony that we need to go investigate. So that’s really exciting. I think they really did find-- there were actually two people that wrote me about that location. Independently found that.

SUSAN: 22:10

That’s so cool.

HEATHER: 22:11

We’ve had citizen scientists spying new emperor colonies. We had a woman who was recovering from knee surgery who spent, I think, five weeks looking for penguins and helped me understand how emperor penguin colonies were moving. When the sea ice moves, the emperor penguin colonies move with the sea ice. And there’s some really interesting dynamics there. So I’ve met people from all over the world that I’ve communicated with and entire classrooms. I went to a school closer to New York City that the fifth grade had dedicated the whole year to this kind of penguin project. And so then I could go and talk to them about that, and they were really into it because they’d spent so much time looking for penguins and satellite imagery. So it was a lot of fun. [music]

SUSAN: 22:51

That is amazing. I love that. We’ll hear more about penguins and image recognition challenges that Heather’s team has faced in a moment. But first, I wanted to give you a quick heads-up about some other cool podcast stuff you’ll want to check out. So you might have heard in our previous episodes that Alteryx has released open-source Python package for data science that you should definitely checkout. Our Alteryx open-source team recently popped up on a couple of other podcasts. And if you’re curious about their work, you should definitely check these out. I mean, you obviously like podcasts, right? And I’m guessing you’re into data science and probably Python too. So over on the Real Python podcast, they had a great discussion of an article on how to troubleshoot memory problems in Python that was written by one of Alteryx team. That article was published on the Alteryx Community as well. But if you want to hear all about it, plus get some other useful tips from the real Python folks, definitely check out episode 68 of the Real Python Podcast.

SUSAN: 23:51

Two Alteryx folks Angela Lin and Jeremy Shih also did a deep dive into EvalML, the automated machine learning package for Alteryx open source, on a recent episode of Podcast.__init__. They explained how EvalML lets data scientists spend more time on the more complex and valuable parts of their work. It also helps folks who aren’t experts get going with machine learning more quickly. If you’re AutoML curious, definitely check this one out. That’s in episode 329 of Podcast.__init__. We’ll get both of those linked in the show notes for you as well. Be sure to check out these podcasts and learn more about Alteryx open-source tools that could make your data science projects more efficient and effective. And now, let’s get back to wildlife and the potential for art to play a role in training models, and TV pilots, and all the other amazing things we haven’t yet explored with Heather. So on this note of kind of bringing in the wisdom of the crowd and the larger community, I was also really interested in another project that you worked on recently. The paper that I looked at as was Social Sensors for Wildlife. Bringing in tourist photos and imagery into your analyses. Can you tell us a little bit about that?

HEATHER: 25:02

Absolutely.

SUSAN: 25:03

I thought that was super cool.

HEATHER: 25:04

Sure. So we work primarily off commercial cruise ships in the Antarctic. And most people don’t think of Antarctica as being a hotbed of tourism, but pre-COVID, there were on the order of 68,000 people that went to Antarctica on vacation. So we’re talking 40 or 50 different vessels carrying passengers down to the Antarctic. And that’s how we access the Antarctic. We work for cruise ships. So you would not believe, or maybe you would, but just the amount of camera technology that tourists to the Antarctic are carrying with them and they have technology that we can’t afford. And we’re so busy catching or counting penguins that we couldn’t possibly take the kind of photos that they do. So the tourists in Antarctica are capturing all sorts of interesting behavior. But in particular, we’re interested in their ability to capture seals. So there’s a couple of different projects that were related to this. In one, we were building a catalog of seals that would allow us to track seals through time. They’re individually identifiable from the patches on their stomachs essentially. And so one tourist might post a photograph of a seal on, let’s say, November 7th, at one location, and we could match that to a photograph that somebody else had captured somewhere else four months later or the next year. And so like they do with a whale tails, we can track individuals through time and see how they’re moving around the Antarctic.

HEATHER: 26:37

Another project that we had also using photographs was trying to map out the distribution of animals through time. So people post pictures of different seals species on the internet. We can pull the metadata off that and say, “Okay. Well, that was at that location. We have a latitude, a longitude. We have a date.” And we can build up a picture of how these different species are moving throughout the Antarctic over time. So there’s a number of different ways that we can use citizen scientists. And I think what was really exciting about these projects here is that it didn’t require people to sign up in some sense to say-- we did have some people that said, “Hey, let me focus on taking pictures of these seals.” But if people post them online, we can actually pull that metadata off the photographs from a much larger collection of people than would necessarily volunteer to help with a research study. So there’s a lot of sort of passive information that’s posted to the internet that we can use to answer some of these interesting ecological questions that’s much bigger in scope than with a more traditional citizen science project, where you have a team of volunteers that are specifically working on that project with you.

SUSAN: 27:51

Right. Right. Yeah. The phrase I put in my notes here was passive eco-social sensors. Thought that was a nice way of putting it.

HEATHER: 27:59

Yeah. It’s funny. We struggled with the title, I mean, for weeks and weeks. And the idea of-- [laughter] the idea of a social sensor was something that has been used in other contexts and so we added the eco part onto that. But I think a growing awareness that as we know throughout all domains of life, that people with cameras that is a very powerful tool. And so we can use that to capture pictures of a straw up a sea turtles’ nose or trash on a beach. I think those are very powerful, but there’s actually some sort of quote-unquote, “Hard-hitting science questions,” that we can answer as well because people are so, I think we use the phrase, camera phone ubiquity. In the age of camera phone ubiquity, they’re everywhere. And so I think we’re only just starting to figure out how to tap into all of those data. So there’s both this huge power and then also we’re sensitive to the fact there are privacy concerns as well.

SUSAN: 28:52

Sure. Absolutely. I was interested to, in that paper, y’all talked a little bit about some of the challenges of using transfer learning. And I believe it was resonant of you working within your analysis of those photos. Can you talk a little bit about what those challenges were and how you dealt with that?

HEATHER: 29:07

Yeah. So this kind of gets to our attempts to build a computer vision models to find penguin guano in high-resolution satellite imagery. That is a very hard problem because we have very small training datasets. And this is sort of the theme of all of our work, which is, if we had 20,000 cat videos or 200,000 cat videos, we could train a model to find cats. But if what we’re looking at are a couple of dozen penguin colonies, or even worse, whales - and we can talk about that whale’s present other challenges - we need to find some ways of bootstrapping those models. And so one way that we have thought to do that is from taking what we know about finding guano in other sensors that are much lower resolution and figuring out how to take what we call sort of weak training data and apply that to train models for the high-resolution data.

HEATHER: 29:59

The other strategy that we’ve used is to take advantage of the fact that penguin colonies don’t move. In fact, a lot of things don’t move. Buildings don’t move. Geological landforms don’t move. So what we can say is, “Okay. If we have an image that we’ve annotated in 2010, the penguin guano in 2011, it’s not going to be identical, but it’s going to be very similar.” So how do we use what we learned about the location of the guano in 2010 to help us classify the 2011 image? So that would almost be easy, except that when you’re a satellite and you’re staring down at the earth, you are taking a two-dimensional image of what is a three-dimension-- you’ve projected a three-dimensional landscape onto two dimensions. And so you have to model the terrain because you have to account for the mountains and valleys that are on the ground actually warping this image. And so depending on the angle that the satellite is taking a picture of the earth, your models are going to-- you are having to do this kind of a abstract warping of the image.

HEATHER: 31:04

And so when you’re looking at multiple images - say a 2010 image, a 2011 image, a 2012 image - they all have slightly different warping, and so they don’t align pixel to pixel. And so that was the computer vision challenge, is how do you avail yourself of that information without boxing yourself in? Because you know that there are these artifacts that are imposed by the terrain. And so the analogy that I would use is in a medical imaging context, let’s say you’re trying to train a classifier to find a kidney. The kidney is in roughly the same spot every time, but every patient’s kidney is going to be in a slightly different place and it’s going to be mushed in a slightly different way. And so how do you take what you’ve learned about where Mike’s kidney is to figure out where Susan’s kidney is. So it’s a similar kind of problem. And so we’ve been very interested in that kind of transfer learning where you’re taking data that’s informative and a human would be able to look at Mike’s kidney, a doctor could look at Mike’s kidney, and be like, “Oh, I know where kidneys are,” and you would have no problem finding it in Susan. But actually, getting a computer to do that kind of close enough approximation is much harder.

SUSAN: 32:14

Yeah. Yeah. That does sound very tricky. So are you doing some sort of transformation to the image to accommodate that, or?

HEATHER: 32:22

So what we’ve essentially done is that we take the-- let’s say we’re trying to use the 2010 image to help us classify the 2011 image. One way to do that is to, essentially, blur it out. So to say, “Okay. Well, we know the guano is approximately in this location.” And so that gets sort of folded into the model for 2011, where we’re not being very specific about where it was in 2010, but you’re like, “It’s vaguely around here.” But another trick that we can play is that we can enforce constraints to say, “Well, I’m not going to tell you where the guano was in 2011, but I know that it has approximately this total area.” And so instead of being actually focused on the geographic location, you could just pull out the area and say, “You’re looking for an object which is about 8,000 square meters.” And so it can penalize essentially shapes that aren’t in the vicinity of 8,000 meters. So that’s another game that we can play. But I would say that we’re still in the first quarter may be to use the [crosstalk] analogy. And so we have strategies that help, but at this point, we can’t walk away and just hand over the annotation to the computers. And I think this really does come about because we have very small training datasets.

HEATHER: 33:35

I had mentioned the whale problem. The whale is the extreme epitome of this. So whales are not impossible to find in satellite imagery, but it’s a vast ocean, and so it’s very hard to build a big training data set. So what we had to do was take drone imagery and downsample the drone imagery to simulate what a 50-centimeter satellite image might look like. So taking an even higher-resolution image and downsampling it to train those models. And so we have developed CNNs to detect whales in satellite imagery. And whales are like the holy grail because whales are both highly charismatic and they are of conservation concern. And yet they are brutally difficult to survey by boat because it’s such a vast ocean. Where satellites are sort of uniquely perfect for this because, in theory, they could just scan the oceans and look for whales that were at the surface of the ocean. And there’s some statistical modeling to get from there to an abundance estimate, but at that point, you have the data that you would need.

HEATHER: 34:38

So one idea that I had, and I haven’t acted on this because things sort of fell apart with COVID, but I was planning this project to bring in a studio artists to heat photo-realistic images of what whales would look like in satellite imagery. I think an artist could look at the few examples that we have and imagine instead of doing all the traditional data augmentation schemes with rotation and reflection and all that, that they could literally just wholesale generate new training datasets. Paint them. And the question I had technically is, could we do data augmentation through studio art essentially and improve our algorithms for finding whales? So I had this idea of this art exhibit that I wanted to do, which would be to teach people about computer vision through the lens of art. So thinking about, what would a red, green, blue-- what are spectral bands? You know what I mean? So kind of starting from a photography lens of like well, you have the red and the green and the blue and kind of explaining all of these ideas for the layperson. There was kind of a blank spot in my exhibit, where I was trying to explain backpropagation in an easy way, which wasn’t--

SUSAN: 35:54

That’s a tricky one.

HEATHER: 35:54

I had some ideas for how to do this. But in any case, the idea would be to sort of walk them through the key ideas of computer vision. And then you’d have this whole exhibit on these paintings that were of whales and satellite imagery. So anyways, I’d been in the midst of planning that when COVID hit. And then all the art spaces-- I had one in mind, in particular. Suddenly, we were looking at, “Well, maybe you could apply to do this in 2022.” Then, 2023. And things just got pushed back. So I wrote a TV pilot instead, which was a whole other artistic endeavor. I had to put it on the back burner. But I’m still interested in this idea as to whether artists just through their power of creative imagination could be helpful for some of these computer vision problems for rare targets. I think it’s a very interesting problem.

SUSAN: 36:38

That is fascinating. I love that idea. And again, it’s kind of one of those thinking outside the box, combining the boxes ideas that’s really cool. So you mentioned a TV pilot. Is that something that you can talk about at all or is that top secret?

HEATHER: 36:50

Sure. No. No. So the North Fork TV Film Festival has a Sloan script science competition or Sloan science script competition to sort of support the development of pilots that would be related to science and technology themes. And for a couple of years, I was one of the judges, one of many judges, of this contest. And especially with COVID, people are sitting around with a bit of extra time on their hands. And I thought, “Well, I should try writing a TV pilot that has a science or technology theme.” So I did. I set about learning about screenwriting, and I audited a screenwriting course here at Stony Brook. And I had this idea the pilot centers on a professor who builds a predictive model - this will be very in line with your audience - that she doesn’t understand, right? Because no one understands how their big black box models work, no matter how well they work. And so the idea is that the series would explore these ideas of interpretable AI and some of the challenges of understanding AI. So in any case, she builds this model that actually works exceptionally well. So in the pilot, it predicts a Presidential assassination. She’s terrified now because she doesn’t know how the model works. But in any case, she will fall into the bad crowd as it were. And she will start selling her models predictions and laundering that money through her university’s advancements. [crosstalk].

SUSAN: 38:08

I’m sorry. As a former faculty member, that just cracks me up. I love it. [laughter]

HEATHER: 38:13

I would also like to sort of highlight, I think, the role of money in a modern research university. [crosstalk]. Not my own university, of course, but the generic university might be willing to look askance at sketchy money that’s sort of coming in through donors to build new fancy research buildings. And this particular professor, she really cares about her students. And I think her intentions are good. She wants to create the kind of learning environment that she thinks her students deserve. But at the same time, she falls in with kind of a tough crowd here. So there’s sort of international intrigue through these AI models. But I think, first of all, there are very few, none, dramas that focus on a university. So you’ve got your medical dramas, you’ve got your legal dramas, your cop dramas, but, but nothing set at an academic research university, so. There’s plenty of drama there. There’s plenty of drama. And I think AI, it surprises me that that hasn’t entered the modern conversation, I guess, in terms of dramatic television because it’s both sort of awesomely powerful and sort of awesomely scary. And so in one of the episodes that I sort of sketched out for the future, she is battling these sort of biased hiring algorithms in her university because HR is using AI to screen applicants for these various faculty positions. And so it seems like a vehicle to get at some of these ethical issues that I think people in AI worry a lot about, but outside, maybe not enough.

SUSAN: 39:48

Yes. Yeah. Absolutely. It sounds both entertaining and educational in that way for sure.

HEATHER: 39:54

I would hope so. So subsequent to that, I started writing little short screenplays that we could use in the classroom. So I’m very intrigued by this idea of using screenplays as a way of-- I teach statistics. So how do we-- could you use a courtroom drama scene to talk about some statistical inferential issues? Because a courtroom is all about making inference from sketchy data and how do you decide significant evidence of guilt? So I think that-- so I’m kind of exploring this idea of a book that would be kind of a series of things that could be read in a sort of very participatory way in the classroom, but would lead to a bigger conversation about these issues.

SUSAN: 40:30

That’s awesome. Now I want to watch your show and take your class and read the book. So sign me up for all of it. That’s very cool.

HEATHER: 40:37

Well, no, I think the joke was that if anyone picks up my TV or pilot, I’m going to have to quit my job because it’s hard to be a faculty member and write a piece of fiction set at a university without it looking like it was inspired close to home, so. But I think that’s the benefit of having a tenured position, I guess, is you can think very broadly about how to incorporate other areas. Particularly in teaching. How can you use free and writing for teaching? How can you use art for teaching?

SUSAN: 41:06

Absolutely. Really neat stuff. So you’ve mentioned some things that you have been working on and some of the potential directions for those. Other things that we haven’t talked about with your work that you’re excited about in the near or distant future that you’d like to touch on?

HEATHER: 41:21

Well, wait. I’m just sort of excited about everything, which this is how I end up, I guess, so overcommitted. But one of the things that I have been very excited about and continue to be is trying to learn as much from the world quantitative finance as I can. So when I was on sabbatical, I audited a number of courses from the quantitative finance program here at Stony Brook. And trying to ask the question, “What are--“ there are people that have a lot of money on the line to make good short-term forecasts and what should we be doing in ecology that we’re not doing? And I did this sort of deep dive into quantitative finance, not enough to become rich and quit my tenured job, but enough to realize that I think we’re actually on the right track. Which in terms of the skills that we teach our students, in terms of the time series forecasting techniques that we’re using, I don’t think that there are suites of tools that were just ignorant of out in quantitative finance. I think we’re using all the same tools. The problems are very hard, and we have less data to work with. But I kind of came back to ecology feeling good about our approaches. And in some sense, I think we’re on the right track.

HEATHER: 42:33

So I’m very interested in this idea of looking at portfolio risk, which is an idea from quantitive finance. So when we think about, let’s say, all the Adelie penguin colonies in Antarctica, how do we think of all of those colonies as a portfolio that is all contributing to the risk of extinction, say rather than looking at individual colonies one at a time? And so that’s something that I have a proposal out now to work on that idea. So that is sort of one area of the quantitative finance realm that I think that we can bring back to ecology, and it’s something that other ecologists have been thinking a lot about is given that we’re working with very volatile time series, how do we think about risk in a more structured way?

SUSAN: 43:14

That’s so interesting. I know the finance folks who are listening to this are going to be really interested and excited to hear about how their field is contributing to what you’re doing, too. That’s neat.

HEATHER: 43:23

Absolutely. And there are people who have worked in both finance and sort of quantitative ecology very successfully. So there’s a tradition of people dividing their time in that way. And I certainly have students who have maybe worked in firms in quantitative finance who have come back to graduate school that are interested in these ecological applications. So I think it’s an exciting area. And yeah, one that I think will pay dividends - no pun intended - over the next decade or so.

SUSAN: 43:52

No pun intended, but happily received, yes.

HEATHER: 43:55

Yeah. Yeah. Exactly. I’ll laugh at my own joke, for sure.

SUSAN: 43:57

That’s a good one. I like it. So we have a question that we ask everybody who comes on the podcast, and we call this the alternative hypothesis segment. So I’ll ask you this question as well. It is, what is something that people often think is true about data science or about being a data scientist in your particular area that you have found to not be true?

HEATHER: 44:19

Oh, boy, that’s a good question. So it’s funny, actually. This kind of comes back to, what is data science? And I wonder how much of what’s called data science used to be called statistics and I kind of wonder where statistics fits into this Canon? Because I teach statistics and I think very traditional theoretical statistics. And I see data science as being this really hot area. And somehow it feels like statistics is the forgotten twin that got left at home when everyone else went to the party. [laughter] And so I guess from my perspective, I think that I’m surprised how little statistics there is in data science in a lot of these data science fields. And so I don’t know I don’t think that’s keeping anyone out of data science. I don’t think anyone’s looking at data science and saying, “I could never be a data scientist because I don’t like math.” I think people plow ahead. But actually, I don’t think that you have to be a mathematical theorist to be really successful in data science. I think there’s a lot of scope for people who really like to program and who are very good algorithmic thinkers but couldn’t necessarily integrate their way out of a box. So I think some people might think that they’re not capable of doing it because they may have run up against some real challenges in mathematics. And certainly, mathematics is a part of data science, but there’s a lot that has databases and all of those aspects where I think that people could be very successful, even without that theoretical background, they might fear that they need if that makes sense.

SUSAN: 45:53

Right. Yeah. Yeah. Absolutely. What do you tell your statistics students about data science?

HEATHER: 45:58

Well, that’s where the jobs are. [laughter] Students take my class because it’s a required element, and it is the broccoli that they need to eat in order to advance in our program. But I do wish sometimes actually that there was more statistical theory in these data science programs let me put it that way. I think that sometimes I worry that we are training data scientists who may go out to reinvent the wheel because they’re unaware of the statistical underpinnings - the sort of statistical theory - behind what they’re doing if that makes sense because it can be very phenomenological and it sometimes feels a little bit devoid of the statistical theory that I’d like to see it connected to. While I think that the sort of math-free nature of a lot of data science is an opportunity, I think it can come at a cost because I risk that we’re reinventing the wheel.

HEATHER: 46:52

And I’ll just give an example a little bit with physics. So physicists don’t have strong statistical backgrounds by and large. My husband and I were married for, I know, 15 years before we realized what he calls a photon statistics is just a Pusan distribution. So it’s like physicists also, I think in many cases have reinvented the wheel because they lack the statistical foundation to know that photons follow a Pusan distribution. And so I do worry that data scientists are also in some sense reinventing the wheel because they’ve forgotten hundreds of years of important statistical theory that underpins much of what they’re doing. But that’s a very self-serving desire since those are the kind of classes that I teach. And of course, I would think that they’re important, so. [laughter]

SUSAN: 47:39

Well, it’s interesting though. And for people who are practicing data scientists currently, would you have suggestions for those who want to maybe build their knowledge of the statistical theories?

HEATHER: 47:48

And I don’t know that there are a ton of online courses that would touch on this in part because there’s not a huge demand of people wanting more steps for probability density functions in their life. But I think that if there is an opportunity to go back and to take some of those classes and to think about theorems and limiting inequalities and all this stuff, I think that the time spent will be worthwhile. Yeah. And so even though I had left physics, and you’d think that I would have all the math that I would need, I had to go to the statistics department as a graduate student. And so all the statistics I took was this more theoretical flavor to it. Then, it was amazing how actually none of that is covered in the undergraduate physics curriculum, unless that has changed in the 20 years since. I don’t think it has. Yeah. But the fact of the matter is that a lot of my students go on to data science careers. I am blessed that they have those opportunities because the faculty job market is limited and all faculty worry a lot about graduating students that have nowhere to go. And that is not true now. So I have students that work at Google. I have students that work at Amazon. These students have great, interesting, intellectually rigorous careers that are waiting for them. And that makes my job a lot easier because I feel really good either giving them a leg up into a world that wants and needs them, which is not the case that a lot of other academic fields-- that’s not the world that they live in, so.

SUSAN: 49:18

Interesting. So I’m just curious. Are you heading back to Antarctica anytime soon? Or is that on hold for the moment?

HEATHER: 49:24

Well, I can’t imagine anything less appealing to many people than getting on a cruise ship and going very far away from medical help than right now. I think the last field season was a complete loss. This field season, I might have one student able to get down on a yacht. But I think the Antarctic cruise industry that we rely on to get to the Antarctic is going to be hurt for a very long time. And even the US, the national programs were largely shut down last year because of the concern about bringing COVID to the stations during those personnel swaps. So it’s a tough time to do Antarctic fieldwork. And I will say, I’m glad that I have this other wing of my work, which is on the data science software sort of engineering, that end of it. Because otherwise, we’d be twiddling our thumbs. And so we’ve got plenty to keep us busy, but it might be awhile before we’re able to resume in earnest our penguin counting. So luckily, the satellites continue to take pictures. It kind of highlights why the satellites are so valuable. But that kind of boots on the groundwork it might be a few years before things were back to normal.

SUSAN: 50:25

Yeah. Yeah. Hopefully not too long. [music] Well, Heather, thank you so much for taking the time to talk with us today. Really appreciate it. It’s been really fun to hear about all of your work, and I’m excited to follow it in the future.

HEATHER: 50:35

Oh, well, thank you so much. I really appreciate the opportunity.

SUSAN: 50:40

Thanks for listening to our Data Science Mixer chat with Heather Lynch. Join us on the Alteryx Community for this week’s Cocktail Conversation to share your thoughts. We don’t all get to study penguins and whales, unfortunately, but many of you are likely using different computer vision applications in your work. What’s your favorite example of how images have provided special insights in your own data science projects, or alternatively, what’s a way you’d like to use images in computer vision in the future? Maybe you have a creative source of images in mind or an innovative analytic method you’re cooking up that you’d like to share. Share your thoughts and ideas by leaving a comment directly on the episode page at community.alteryx.com/podcast or post on social media with the #DataScienceMixer and tag Alteryx. Cheers.

 


 

This episode of Data Science Mixer was produced by Susan Currie Sivek (@SusanCS) and Maddie Johannsen (@MaddieJ).
Special thanks to Ian Stonehouse for the theme music track, and @TaraM  for our album artwork.