The project has just passed its half-way mark. We are cruising through our data collection phase. This summer will be very busy for the project and critical towards its success.
The Boston Museum of Science study continues to chug along smoothly. The Research Assistant works two shifts per week with the MoS staff out on the Museum floor collecting data from children viewing the 2D vs. 3D slides. Our goal is to reach 400 test subjects. At the current rate, we should reach that in July. However, it is possible that recruitment will pick up in the summer as the Museum becomes more busy.
Once we have the data, I’ll began crunching it quickly. I’ll work closely with one of my Co-PIs, a professor at University of California, Santa Cruz, to prepare a set of mini papers and presentations to educational research conferences next year. Most of those have submission deadlines late this summer, so we won’t have much time to play with the data. It will be all work. Luckily, my family and I are moving to an apartment next to a beach on Lake Michigan in Chicago this summer. So I hope to set up shop with the laptop and an umbrella, so as not to miss the summer sunshine.
The Adler study is a little behind schedule, but it should also end up finishing on time in the end. Our narrator just finished recording her tracks and the production team is putting the final touches on the films. We plan to begin showing the two films at the Adler Planetarium in their Space Visualization Lab during the 2nd week of June. Stop by if you want to participate! The plan is to do 3 screenings per day, five days a week. If we are able to get 20 people per screening then we should reach our data collection goal for this project in mid July. We are scheduled to work until the end of August. So it gives us some room for error (sick days, slow days, equipment trouble, etc.). If we do happen to reach the goal early, then we have a month to get extra data and/or run an entirely new, bonus study.
Basically, we should be done with all data collection by the end of the summer. We have to be really, since our data collection funding runs out then. That gives us an entire year to analyze results and write papers. For me, that is the fun part. Right now we plan on at least three papers for major science education research journals. I expect we’ll end up with many more, and that doesn’t count conference presentations. At this point, no one has studied this – yet development in 3D in the classroom and informal settings continues at a fast pace. There continues to be a need for this information.
A screen shot from a draft of the film about the Milky Way galaxy.
This is a report on the 2nd study of this project, taking place at the Adler Planetarium.
The first film, about Type-1a supernovae, is just about in the can. It runs about 7-8 minutes in length. I think it is a little dense on the content, but we slowed down the narration to compensate and some small test screenings have gone well. The second film, about galaxy morphology, is currently being written. The plan is to storyboard early next month and develop the film for pilot testing in April. If all goes well, both films will be airing to the public in the summer - and we’ll be collecting data the entire time.
We have looked at the pilot data of the first film and test sessions. Because of technical reasons, we could only show the 3D film so we don’t have any 2D vs. 3D comparison data. Of the data we have, we don’t find any relationships with correct answers and the spatial cognition scores (nor for gender or age). But that’s not our core question, which is whether those scores are related to the difference in correct answers between 2D and 3D. Plus, our sample size was tiny (N=33). The pilot was mainly about testing the software, procedures and the test items.
We had four test items - two multiple choice, followed by “Explain Your Choice:” text boxes and two “draw and label” questions. We’ve opted to take the “explain” answers and use them to derive 4-5 options for answers, which will replace the prior multiple choice options. The reason for this is that it will lower the time it takes to take the pre- and post-test. Right now each test takes about 15 minutes, meaning 30 minutes for the whole day. That’s unacceptable for an audience that is attending the planetarium for fun! We need to get it down to 8-10 minutes. As a rule, I hate multiple choice tests. I’ll save that rant for another time. But one way to make them slightly more palatable is to use open-ended (“explain your answer”) data to create multiple choice options that reflect authentic thinking and are not artificially generated by an outside person. So I think we have a good compromise here.
As for the drawing questions, we decided to drop one of them. Looking at the data, it seems that there is a greater difference between the pre- and post-test answers on one of the items than the other. This implies it may be more sensitive to differences in groups, so we decided to keep it. Here is a sample pre- and post-test drawing made by one of the pilot participants:
The item asked them to draw two white dwarves merging. The first drawing shows two stars with a black hole in the middle. It is interesting that the person’s prior knowledge made them think there is a black hole in the middle of two merging stars (which is a very complicated concept to decipher). The post test still has the black hole, but now has lines to indicate an explosion, or greater luminosity. It also has arrows showing momentum. Are those lines orbital paths of two stars? Or do they represent the surface of two stars? The answer is important as it changes the interpretation of the arrows, which could indicate orbital motion or a spinning sphere or disc. This is why drawing questions are so tough to score. They can often reveal much more nuanced understanding by the participant, but scoring them requires sense making by the scorer (a.k.a. “grader”), thus introducing a source of noise. In education research, I fall into the “mixed methods” camp - which states that the best research uses both qualitative and quantitative methods. Hence why this study has both traditional test questions and these drawing tasks.
The recruitment challenges we reported on recently with the Museum of Science (MoS) have been largely addressed, thanks to near-Herculean efforts by their staff. They gave us a new location on the Museum floor that is open to the general public. Since then, recruitment has been much easier and the ages of the participants have increased.
I also have a glimpse at some (very) early results. Recall that in this study children aged 5-12 are shown pictures of scientific concepts and objects. The pictures are chosen randomly to be in either 2D or 3D/stereoscopic. They are then asked two questions about it. One is a specific question about a spatial property of the object (Ex: Which is closer to the camera, the tree on the left or the right?). The second is a question about the implications of the first question (Ex: If you were in a hurry to run to a tree, which would you run towards?).
Below is the graph of results of two questions we have about a picture of a bee. The first question is about the shape of the bee’s tongue. The second is a question about how the bee uses the tongue to sip nectar from different shaped flowers.
The results show that there is no difference in the number of “correct” answers to the first question. However, there is a difference in the number of “correct” answers to the second question – which is the one about implications. This could mean that 3D/stereoscopy doesn’t have much effect on immediate image processing but does have a larger effect on interpretation of the image. The sample size of this graph is enough to make this result statistically significant to the .01 level. However, the sample is still only about 1/6 as large as what we hope our final sample size will be. So this result could easily change. Also, it’s only one of 15 similar images we are testing. So this by no means would qualify as a result on its own. But it’s a hint of what could be and is an illustration of how we intend to use this data.
Data collection continues at the Boston MoS through the spring. We hope to have it wrapped up this May and then can get our hands dirty with the analysis. We’ll probably present some type of early result to the American Astronomical Society meeting in Indianapolis in June and hopefully at either the National Association of Research in Science Teaching (NARST) or American Educational Research Association (AERA) meetings in spring 2014.
The spatial cognition test as part of the Museum of Science study is based on the Purdue Visual Rotations Test. It is a 20 item test, of which we randomly chose 5 for this study (we don’t have time to do all 20). So far, we have run the test on 17 children. Below are the results:
Item #1: 47% got it correct
Item #2: 18% got it correct
Item #3: 0% got it correct
Item #4: 18% got it correct
Item #5: 29% got it correct
Each item on the test has five possible answers. So if the children guess at each item they have a 20% chance of getting it correct. So you can see from the results that the results for items 2 and 4 are similar to that provided by chance. We don’t need a statistical test to see that. But items 1, 3 and 5 are offset from 20% by different amounts. So can we trust those results as being legit? That is, that the result is due to actual measurement of spatial ability of the child and not due to chance? 47% is pretty high, so my gut says that’s a real result. 29% is not that far from 20%, so I’m not so sure about that. And the 0% result from item #3 is in the middle.
So I wanted to run a statistical test called a t-test. It’s one of the most basic statistical operations one can run and is often taught in the first week of Stats-101. The t-test assumes that our data follows a pattern seen commonly in nature known as the Normal Distribution. To run it, you need to first check your data to see if it meets this qualification. Below is a plot of the responses to items on the test. The curved line is an idealized version of the Normal Distribution.
Our data follows the distribution relatively well. So I ran the test. Results are often reported in terms of p-values. The most common p-value used is p=.05. That means the odds of the result being due to chance is around 5%. Statistically, in the social sciences a value of .05 or below is often reported as “significant”.
Our tests resulted in the following values:
Item #1: p=.002
Item #2: p=.083
Item #3: n/a because no one got a correct answer!
Item #4: p=.083
Item #5: p=.020
So this tells me that items 1 and 5 may indeed be measuring something in the children. Items 2 and 4 cannot be differentiated from chance, so right now those items are not very helpful. Item #3 was likely too difficult for this test.
This is what pilot testing is all about. I’ve decided to toss out items 2-4 and replace them with new ones from the pool of items in the Purdue test item bank. I took the new ones from the front of the item bank, which conceivably means they will be easier. We’ll analyze the data again after about 15 more children have taken the test. Eventually, we hope to have items that work for all children. Then we start to collect real data.
Caveat: In real life *much* more goes into the choice of items than just sensitivity to the population. I’m focusing only on one aspect here to act as an illustration of how the pilot test process works. Also, a sample size of 17 is hardly good enough to run a t-test on real data. That is, I’d never be able to publish such a result! :) But, again, for our purposes it works fine as an example of how this process works.
I spent the last week in September at the Museum of Science (MoS) in Boston. The goal of that trip was to setup and pilot test our stereoscopic kiosk for use for our second study. That study is about how children perceive spatial information in 2D vs. 3D images. The week was productive, but not as much as I anticipated. The biggest challenge occurred the day I walked in the door - our research location had been changed!
I wrote this grant two years ago. It took about 9 months to be judged and awarded by the NSF. The project officially began October 1, 2011 and this last year has been spent building things. During that time the rest of the world wasn’t going to stop and wait for us. When we wrote the proposal, the MoS had a space where I could do research on the museum floor with children 7-12. However, that space has seen been destroyed! The MoS has begun a major renovation to install a new permanent, cutting edge exhibit called the Hall of Life. So now the only space they have for me is in their Discovery Center, a place that is restricted to children up to age 8!
The Discovery Center is a great, fun place with helpful, smart staff and enthusiastic parents. The problem is the children are much younger than what my study was designed for. All of my test questions were written for children with some experience in science. Indeed, the entire design of the test was predicated on the ability of the child to be able to read. So this was, and still is, a big problem.
There is no other place in the museum for us. So we tried to adjust the test to make it age appropriate. We simplified some of the test questions. We changed the test protocol so that our research assistant is now reading the questions to the child and pointing to the screen, instead of the child doing it on their own.
But one thing we cannot change is the spatial cognition pre-test. Recall this is a critical part of our research that is used to measure prior spatial ability of the child. This is important because our ultimate research question is how spatial ability is related to how children perceive 2D vs 3D images. I think our spatial test is going to be too hard for most children and give us a floor effect in the data. There is no easy solution as there are no spatial visualization tests for young children that do not involve a human proctor and a significant amount of time (our test session time is limited to 15 minutes by museum rules). For now, we’re using this test with the plan to look at the data and see how sensitive the test will be. I’ll have a report in a few days.
This is just one example of how the age difference, while minor, has major implications for this study. There are lots of other ways in which our fundamental research questions had to be changed due to this limitation. Our “semester” at the MoS ends in December. At that point we’ll look at all our data and decide whether we can continue with this plan or whether we need to make other adjustments.
It’s an example of what happens in a real world research experiment. Unlike laboratory exercises, we have many additional sources of complications. We have to be flexible and less rigid in our design. The upside is our experiment is happening in a slightly more authentic environment (than a lab), so we’ll be able to say a little more about the results. And, ultimately, the goal is to have results that are applicable to real life. But the lab vs. in situ debate will have to wait for another blog post.
Last week we pilot tested the ipad assessment software and items for the first film at the Adler Planetarium. We offered free guest admission ($12 value) in return for watching the film and taking the pre- and post-tests. We did 3 showings of the film for a total audience of around 45.
The film is about 75% completed. Stock footage is used in a few places and some 2D/3D modeling work still needs to be done. Some of the scenes are taking over a week to render on their computers!
Overall, the process piloted well. People were more used to iPads than I had expected (we only had to help a couple of people) and no one complained about the length or complexity of the assessment items. The coordination of taking both tests and uploading the data was virtually flawless. We learned some small things (such as how stacking ipads with smart covers can make them all remain in an “on” state - thus draining batteries), but nothing major. I think, from a user perspective, they are ready for prime time deployment as assessment devices for the general public.
I have not looked at the data yet, other than to confirm it is there. So I don’t have any item statistics yet. I’ll do that in a few weeks and will post a summary of results. However, I can already tell that we will need to time the tests. Some people flew through it in a few minutes. Some others took 15 minutes and never moved beyond the 6-item spatial cognition pre-test!
The most difficult part of recruitment was that school has started for many so attendance at the planetarium wasn’t very high. Most people, when invited, were interested and we received only one response that we’d classify as “rude”. :) The most common reason people declined to participate was time (they were only going to be at the planetarium for a limited time) and the second reason was language (lots of international tourist visitors).
During the final showing the audio of the film messed up and we had to pause a couple of times to resync it. That was frustrating, but it’s the reason we pilot test these things. The audience was supportive and some even said they found it interesting to watch the process and be a part of it.
Here is a photo of a few people taking the pre-test.
It’s been two months since the last update. Since then I’ve changed jobs and moved from Boston to Chicago. My new position is Manager of Research and Evaluation at the Museum of Science and Industry, Chicago. I’m still running this project, though, as a contractor to the AAVSO. Moving a family 2000 miles slowed some of the work over the summer, but it didn’t stop it. Our app has been submitted to Apple for publication in their App Store. Our data collection shifts at the Boston Museum of Science have been scheduled. Also, we are two weeks away from pilot testing our first film at the Adler Planetarium with the public. Data collection for both studies begins in October! And there is much more, so there should be lots of reports coming fast and furious in the next few months.
Our 2nd study is going to consist of a test given to people before and after they watch a 2D or 3D film about supernovae or galaxy evolution. The test will be given on an iPad. We chose that technology (as opposed to traditional paper on clipboards) because:
We hired a contractor, Clockwork Active Media, to code the software and they are great to work with. They are very professional, talented and full of initiative. I highly recommend them for any of your research instrumentation needs.
Below is a mockup of some of the items. We plan to pilot test them this August at the Adler Planetarium in Chicago. The final software will be released via an open source license (probably via Sourceforge) in a year or two. So if you are interested in using iPads in your research, let us know! We can likely give you an early copy of the code if needed.
I ran across an interesting statistic in the book Scorecasting. The authors had dedicated a chapter to whether there is a home/away effect in sports (there is, but not for the reasons you think). They mention that the percentage of free-throwing shooting in basketball does not change between teams playing at home or on the road. After about 20,000 games being analyzed, the free-throw success rate was 75.9% for both home and away teams.
That surprised me. I’m not a huge fan of basketball, but I do recall seeing scenes like this where fans go crazy trying to distract free-throw shooters:
I can understand how these tricks may not work very well, but to not work at all? With a sample size of many thousands of games your statistical power is pretty great. You can usually detect anything you want. Yet the success rate is the exact same (down to tenths of a percentage). The fans have no impact on the shooter’s ability to deliver the basket. Why not?
It got me thinking about 3D. When we look at shots like those above, we are seeing them in 2D. There is very few pictorial cues to differentiate the fans and their toys from the basket. While, in real life, the basketball player has the benefit of depth perception to separate the basketball net from the fans. Thus, the television/computer screen is amplifying the distraction by flattening everything into one plane.
I’d like to watch a basketball game in 3D to test this theory. Or better yet, get tickets for the Garden. Sadly, tickets for 2 are probably more than a nice 3D TV these days. I may have to stick with the glasses.
Spring is the travel season for science education researchers. The American Education Research Association (AERA), National Association of Research in Science Teaching (NARST) and National Science Teacher’s Association (NSTA) are all usually in March or April. Also, the NSF’s Informal Science Education (ISE) division has a semi-annual conference ISE PI’s are expected to attend - and it is also typically in March. So not much actual project development work gets done. I was gone two out of the last four weeks and the other two weeks are making up for lost work in the day job.
Still, I did visit Worcester Polytechnic Institute to give a talk on the project to their Learning Sciences department. And today I gave a virtual talk to a spatial cognition class at Penn State that focused on this project. Part of running these big, grant-funded project is “flying the flag” - giving talks, presentations posters, etc. to let the greater community know what you are doing. My graduate advisor used to say that research is only turned to knowledge when it is shared.
I was talking to one of my Co-PI’s today about the ipad instrument we are using for the pre- and post-tests around the films. We are discussing item formats. We all know that multiple choice items are the worst type of test item ever. They are very, very coarse measures of knowledge - at best. But not everyone is comfortable using an ipad to write out a few sentences or a paragraph. She suggested we have them answer the questions with their voice.
Brilliant! My head was stuck in the paper-and-pencil past and I was not using the full capabilities of the iPad. I think what we’ll do is give the person the option to type an answer using the touch pad keyboard or “click here to speak your answer”. Of course, that means someone will have to transcribe the responses. But one step at a time. :)
And so ends the most boring post in the project. I promise things will get interesting soon as we begin developing real products and testing stuff, i.e. data!
Last week was an interesting learning experience visiting the Space Visualization Lab at Adler Planetarium. Four of us on the Two Eyes, 3D project met there to discuss the first film and methodology. I have never been involved in development of a planetarium/science visualization show before, so I was very interested in the process.
It began before the meeting when I assembled the stereoscopic design principles we plan to use. I also wrote a first draft of the script for an 8 minute show on Type 1a Supernovae. I had no idea where to start so the script was truly a stab in the dark.
We began by talking about the principles. I was happy to hear from some very distinguished and experienced producers/directors that some of the principles are very much common sense in the industry. That doesn’t mean that everyone follows them! One example given was the recent movie Owls of Ga’Hool, where the director said he sought to break all the 3D rules when making the film. Still, we had many times when fidelity to the principles meant making some changes to the script.
After discussing the principles, we went over the script. We decided to narrow the focus from Type 1a supernovae in general to SN 2011fe in particular. It is the brightest supernovae in over 20 years and, as such, provided a ton of data to astronomers trying to figure out exactly what causes them. There is an interesting narrative story behind SN 2011fe that we can exploit. Our only major concern is that, since it is topical, the science could change quickly over the next few years as the data is analyzed and published. So we’re taking a risk aiming at a moving topic!
I went to the hotel room in the evening and rewrote about 70% of the script on my iPad. The next morning we went over the new script, changed a few things (such as the opening) and story boarded it. Below are some images from our excellent storyboard artist, Julieta Aguilera. The next step is to polish the script this week and send it back. Then the Adler folk begin production. My next step is to direct a recording session for our scratch track narration this May. We’ll reconvene in mid July and do some pilot testing of the research methodology and focus group testing of the film. The premiere is tentatively set for next November or December.
Regarding the methodology, we also came up with a draft assessment instrument design and blocked how we will recruit and give the assessments.
So there is still a lot of work, but we got a lot done. I was worried about some aspects of the creative process, having not done anything like this before, but it went well and we got more done than expected. Now I have to get back to working on study #1, which I don’t think I’ve described yet. That will be the next post…
Below is a video of Dr. Mark SubbaRao, one of the Co-PIs on the project and the Director/Producer of the films we are making. He’s talking about gravity and galaxies using a simulation he created. Our second film will be on this exact topic.
I’m heading to the Adler Planetarium this weekend to work on scripts, treatments and storyboards for our two films. We’ll also be checking out the space and brainstorming over some research methodology questions.
In preparation I read 83 evidence-based papers on using stereoscopy to convey information. The goal was to distill them into a set of design principles we should use to design our films. Remember, these are educational films about galaxies and supernovae. Entertainment is important, but only in how it assists learning. So we are less interested in “awe” than in thoughtfulness.
I’ve come up with four principles that we’ll work with. It’s definitely a work-in-progress and I expect them to change by the time the project is completed. More papers will be found, discussions with practitioners and theorists will bring up new ideas, and of course our data will have the final word. But it’s a good start.
I can’t share them publicly here because, frankly, I put a lot of work into this so I feel some ownership of the I.P. I’d hate for someone to take the work and publish it as their own. However, I’m willing to share them via e-mail with people who are working on like minded stuff. Just drop me a note.
I’ll report from Chi-town in a week or so.
This paper was recently published by Sarah Ting, Tele Tan, Geoff West, Andrew Squelch, and Jonathan Foster, most of whom are with Curtin University in Australia. Their goal was to “…to quantify and compare the human brain’s response to 2D versus 3D images using EEG technology.” EEG stands for Electroencephalography, which is a process of monitoring electrical activity in the brain (using equipment like in the picture below). I don’t have a background in EEG so cannot comment on their procedures. However, their research design and results are interesting.
They showed a person a series of images that were the same (such as a series of squares) and then included a nonstandard image (such as a circle). This “oddball” is well known in the field to generate a specific electrical response in the brain. For this study, instead of squares they used stereoscopic cubes. And instead of circles they used stereoscopic spheres. Finally, they created 3 versions of each cube and sphere based on how occluded they were. One version showed the entire object. Another version had a cloud-like obstruction blocking 30% of the object. The last version has a cloud-like obstruction blocking 60% of it. 11 subjects were tested with both 2D and stereoscopic versions of the images. Each was given some practice time at the start to become familiar with the stereoscopic system.
They found that the amplitude of the EEG signal they were monitoring was different between the 2D and stereoscopic stimuli. And that difference was related to the level of occlusion. They also found that with at 0% occlusion there is no difference in response rate in terms of time between appearance of the stimulus and the measured signal (in fact, they suggest in the Conclusion that stereoscopic may have been faster). At 30% occlusion there was no difference. However, at 60% occlusion there was a difference between 2D and stereoscopic images, with stereoscopic being more delayed.
This made me happy because it supports the findings in my first paper: stereoscopy increases cognitive load and the amount of increase is related to the complexity of the stereoscopic image. In fact, that is how this study came across my desk. They cite the paper and say that “[it’s] findings are consistent with this study…” and that “…[our] inference is consistent with [their] findings.” The inference being that the human mind, when seeing things on a flat screen, is conditioned to interpret it in 2D. So even if it is stereoscopic, the brain converts it back to 2D before analyzing the image.
That paper was my first real education research paper, so seeing the results cited and confirmed is exciting to this young researcher. An old hat may be looking at this and laughing, but a first is still a first!
Their research was reported as a pilot study, so I’m very excited to see what this team does next.
EEG display (Wikimedia Commons)