So one of the items on the slide show test at the MoS Boston is a picture of a hurricane from the International Space Station. Some of the children saw the picture in 2D and some saw it in 3D (both were wearing glasses). After they looked at it, we removed it from the screen and asked them to draw what they saw.
We had two people code the drawings (see last post for details). We gave them two rubrics: one about how the child drew the eye of the hurricane and the other about how they drew the overall shape.
Our analysis, so far (i.e. it’s very early and could change), shows that there is no difference in how the children drew the eye and there is a large difference in how they drew the shape of the hurricane. The latter is mainly through an increased number of children who drew it as a spiral as opposed to a donut. One possible meaning of this could be that 3D, in this experimental condition, increased the ability to denote shape but not structure. This surprises me, and goes against my hypothesis, because structure is more dependent on spatial depth than shape is. Nevertheless, the data is what it is - and the difference seems pretty robust at this point.
Next, I need to check the data and analysis to confirm the result. Then I will start to see if the result is related to any other factors we recorded (covariates) such as prior spatial ability, gender, age, etc.
The slide we showed.
So we are deep into the data analysis phase now. Al the data has been processed, stored, backed up, etc. ad infinitum. Now we start looking for signals and stories.
Recall that for both of our studies we had participants answer multiple-choice questions and also draw pictures. The multiple-choice questions will be analyzed as quantitative data, where numbers represent the answers given in a somewhat objective manner. The drawing questions will first be analyzed as qualitative data, which is subjective in nature.
For example, we can ask someone “Is the Sun a star?” with the choice to answer “Yes” or “No”. We can assign a “1” to “Yes” (the correct answer) and a “0” to “No”. Once we set that rule, no matter who analyzes the data, they know what a 1 and a 0 mean. 1 means they got it right, 0 means they got it wrong.
With our qualitative data, we have drawings. For one set of drawings we asked them to draw the Milky Way galaxy and label the location of the Sun. In order to determine if the person placed the Sun in the correct location we have to judge a number of things about the drawing. We first have to judge whether they drew the galaxy accurately enough to allow us to denote a location. Then, if so, we have to judge whether the location they gave was correct or not. We have a general idea of where the Sun is in our galaxy. But that position, as shown on a drawing, can change based on the viewing angle of the galaxy.
We are interested in this question because it is something talked about in the 3D film they watch. Also, it is a very spatial concept to understand – because we only see the galaxy from within. So how do we know where the Sun is in relation to the rest of the galaxy? It requires some 3D thinking about how a spiral galaxy would look from the inside, something the film attempts to convey using stereoscopy.
Here are some pictures of the galaxy with the Sun drawn or labeled on it by three adult study participants:
For differing reasons for each of those drawings, someone could argue that it is correct and someone could argue that it is wrong. How can we analyze data when we can’t even agree if the answer is right or not?
In education research, the typical procedure is to have multiple people rate (grade) a set of test drawings using a common set of instructions (rubric). You then compare the results and see if they are the same – this is called inter-rater reliability. If it is the same, then you do that for all the rest of the drawings. If it is not the same, then you talk about why each person rated it differently, adjust the rubric to be more precise, and start again with a new set of data. One the inter-rate reliability reaches a certain level, you stop the iterative process and begin rating the full, real data set. The point that you stop at can change based on the person and project – in general it depends on the research question. You almost never get 100% reliability. In general, I use a statistic called kappa (which is correlation minus chance) and if it is .7 or .8 or higher, we’re good.
That is the stage we are at now. We are almost done finishing six rubrics for the two studies. We have about 2,300 drawings that need to be analyzed. So we’ll create almost 15,000 ratings when we’re done. It will be an amazing data set. But it will take a long time to analyze. We’ll talk about that next…
So this has been a pretty wild summer. Data collection on both projects is at full blast. For the Boston MoS study, we currently have around 275 study participants. For the Adler Planetarium study, we have about 876. We are well on pace for the Adler study and should meet our goal of 1000 by the end of August (we will likely exceed it!).
But we probably won’t reach our goal of 400 for the Boston MoS study. This is despite the fact that we extended data collection more than 3 months longer than originally planned and the fact that the MoS is graciously donating some of their own intern time to the project. Recruitment is going fine, it just takes a long time to process a family through the kiosk and we have limited hours on the floor. However, this will likely be OK. Looking at the stats, it looks like this will be enough to answer our core research questions about 3D vs. 2D representations.
I have spent a lot of time with the data in the last month, with some help from experts who are also part of the project, and results are starting to form up. Our original plan was to devote year 3 to analysis of the data (which begins in October). I think we’ll probably have some results for this study by the end of December and hope to submit a paper by then. The rest of the year will be spent on the second study at the Adler.
All summer we had hoped to submit a paper to an education conference called AERA. However, we just didn’t have enough data or time by their July 22 deadline (I think their deadlines are way too early – the conference is not until April). We were going to submit to another conference – NARST (which is my favorite anyway) on August 15. However, I need to present a work related paper there and they only allow one primary-authored publication. So that won’t “work”. Now we’re looking at the ICLS conference, which is due in November. So the timing is right. And that’s also one of my favorite conferences to attend.
This is what happens when you are rounding 3rd on a research project (its the dog days of August and pennant races are afoot - allow me my baseball metaphors!) . Now deadlines mix with data to create a weird matrix of what-has-to-be-done-when. At one point I was burning the midnight oil trying to make this work. Then it struck me – l have an entire year to do the analysis and publication for this project. Why am I in such a hurry? Let’s slow down, do it right abd actually enjoy it. The best part of a project is the first time you see the data loaded into SPSS/your stats software of choice. Let’s make it last.
By the end of this month data collection should be done with both projects. That will be a huge stress relief. There is a lot of busywork and (for a worry wart like me) stress that goes into running data collection on two studies simultaneously and remotely. At work I manage 4 researchers in the same type of work – but I’m there everyday to interact with them. Being far away is a little nerve racking. Am I being too hands-off? Or am I micromanaging? Do they want more input from me or do they want me to back off? Am I being too much of a cheerleader or too pessimistic? One thing is almost definite – I’m overthinking it. :)
Once the data is in my hands (and the hands of my Co-PIs) then the fun begins. But the hard part is by no means over. I’ll explain why next…
The Adler study has begun active data collection. The study uses software run on iPads. The software was expertly produced by Clockwork Active Media Systems and funded by the National Science Foundation grant. We have decided to release the Objective C source code under the GNU Affero open source license. You can download the software and learn more via its GitHub page:
This begs the question: Why would you want to use software that is so uniquely designed for this project? As a whole, you probably don’t. However, there are pieces of the code that are somewhat unique that someone may want to use for their own software and/or study. So here is a description of how the software works with some pointers at interesting pieces.
The program opens with an admin page for the researcher. This is where they tell the software which of the two films the audience is about to see and whether it will be in 2D or 3D. After that, it moves on to the test – which is customized based on which film they are watching.
The first test page asks some demographic questions. There is nothing interesting here. Then the user is given 5 spatial cognition tasks taken from the Purdue Visualization of Rotations Test . Note the tests themselves are NOT in the public domain. However, our experience is that the author is extremely helpful and was willing to let us use the test with minimum fuss. In return, we will share our results with him.
After those test items are done, the user is given a test item in a format that begins with a question and then lets them click on one of four images to answer it. Then they are asked about their confidence in the answer. That is followed by two multiple choice text questions. The final question is a drawing task, where the user draws with their fingers.
When that is done, the app pauses for five minutes so the user can watch the film. This happens to prevent people from skipping ahead. When the film is over, the app is ready for the post-test. This test is exactly the same as the one before, except without the demographic and spatial cognition questions.
When they are done, the software uploads the data to a server (also available in the source code). The server parses it into an XML file that the researcher downloads later for analysis. If the data upload does not succeed for any reason, the border of the screen turns red, thereby informing the researcher. It will automatically try again after a few minutes and eventually turn blue upon success.
While all this is happening, the software is recording the time of completion for each task and item. It is also recording the accelerometers in X,Y,Z coordinates based on time. We will analyze that data to see if there is a relationship between how the user held the device and the results.
So, some interesting things that may be of use to other researchers/coders are: 1. The client/server relationship 2. The accelerometer archiving and 3. The drawing GUI (which I think is very intuitive and simple – nice job, Clockwork!) . Of course, the entire framework may be of use to anyone who wants to do a pre/post test of any type of short-timed intervention.
Once the Adler study is finished, we will also release the two films into the public domain using a Creative Commons license. Expect that to occur in the Fall.
A screen shot of the drawing task:
The intermission screen:
The multiple choice image format:
The source repository contains the full Xcode project for Two Eyes, 3D, as well as the PHP web services for uploading quiz data, listing it, and updating the quiz remotely.
A word about Clockwork: They were terrific to work with. I’m relatively new to the world of professionally coded software and it was a treat to work with them. They were patient, professional, courteous, organized and very, very smart. I highly recommend them to any researchers needing expert coding help from a professional group.
The second film is almost done. The crack team at the Adler Space Visualization Lab have done a great job. These two films will be shown at the Adler throughout the summer on almost every weekday – free to visitors. We also plan to eventually place them on YouTube (in both 2D and 3D form) as well. And they will be released into the Creative Commons so almost anyone can use/edit/play with them as they wish.
In about two weeks official data collection will begin at the Adler. Last night I tested the last version of the iPad testing software, created by a similarly crack team at Clockwork Active Media Systems. That too will be released using an open source license and likely posted to SourceForge in the next month or so.
Also in two weeks our RA at the Boston Museum of Science will be resigning to spend summer with her two boys. We are very grateful for her help this year. If anyone is looking for a part-time research assistant in the Boston area for the next year, please contact me and I’ll put you in touch. I cannot recommend her enough. A new intern from the Museum will be taking over data collection duties, which we expect to run through the end of July. Right now we have about 275 of the 400 test responses we planned for.
This summer is really where the wheels meet the road. Lots of data collection and various items to be juggled. But when it is done, the data will be here and the fun part begins – analysis and results.
This is a screen shot from the 2nd film, which is about the shape of the Milky Way galaxy. This is from a scene that describes a Native American story of the Milky Way being a path of animals across the sky.
The following shot is from a scene that describes how some early models of the galaxy involved spheres of stars to explain the band of light we see in the night sky. This will be shown stereoscopically, which we hope will help explain how spheres can be oriented to appear as bands of light when seen from far away.
The project has just passed its half-way mark. We are cruising through our data collection phase. This summer will be very busy for the project and critical towards its success.
The Boston Museum of Science study continues to chug along smoothly. The Research Assistant works two shifts per week with the MoS staff out on the Museum floor collecting data from children viewing the 2D vs. 3D slides. Our goal is to reach 400 test subjects. At the current rate, we should reach that in July. However, it is possible that recruitment will pick up in the summer as the Museum becomes more busy.
Once we have the data, I’ll began crunching it quickly. I’ll work closely with one of my Co-PIs, a professor at University of California, Santa Cruz, to prepare a set of mini papers and presentations to educational research conferences next year. Most of those have submission deadlines late this summer, so we won’t have much time to play with the data. It will be all work. Luckily, my family and I are moving to an apartment next to a beach on Lake Michigan in Chicago this summer. So I hope to set up shop with the laptop and an umbrella, so as not to miss the summer sunshine.
The Adler study is a little behind schedule, but it should also end up finishing on time in the end. Our narrator just finished recording her tracks and the production team is putting the final touches on the films. We plan to begin showing the two films at the Adler Planetarium in their Space Visualization Lab during the 2nd week of June. Stop by if you want to participate! The plan is to do 3 screenings per day, five days a week. If we are able to get 20 people per screening then we should reach our data collection goal for this project in mid July. We are scheduled to work until the end of August. So it gives us some room for error (sick days, slow days, equipment trouble, etc.). If we do happen to reach the goal early, then we have a month to get extra data and/or run an entirely new, bonus study.
Basically, we should be done with all data collection by the end of the summer. We have to be really, since our data collection funding runs out then. That gives us an entire year to analyze results and write papers. For me, that is the fun part. Right now we plan on at least three papers for major science education research journals. I expect we’ll end up with many more, and that doesn’t count conference presentations. At this point, no one has studied this – yet development in 3D in the classroom and informal settings continues at a fast pace. There continues to be a need for this information.
A screen shot from a draft of the film about the Milky Way galaxy.
This is a report on the 2nd study of this project, taking place at the Adler Planetarium.
The first film, about Type-1a supernovae, is just about in the can. It runs about 7-8 minutes in length. I think it is a little dense on the content, but we slowed down the narration to compensate and some small test screenings have gone well. The second film, about galaxy morphology, is currently being written. The plan is to storyboard early next month and develop the film for pilot testing in April. If all goes well, both films will be airing to the public in the summer - and we’ll be collecting data the entire time.
We have looked at the pilot data of the first film and test sessions. Because of technical reasons, we could only show the 3D film so we don’t have any 2D vs. 3D comparison data. Of the data we have, we don’t find any relationships with correct answers and the spatial cognition scores (nor for gender or age). But that’s not our core question, which is whether those scores are related to the difference in correct answers between 2D and 3D. Plus, our sample size was tiny (N=33). The pilot was mainly about testing the software, procedures and the test items.
We had four test items - two multiple choice, followed by “Explain Your Choice:” text boxes and two “draw and label” questions. We’ve opted to take the “explain” answers and use them to derive 4-5 options for answers, which will replace the prior multiple choice options. The reason for this is that it will lower the time it takes to take the pre- and post-test. Right now each test takes about 15 minutes, meaning 30 minutes for the whole day. That’s unacceptable for an audience that is attending the planetarium for fun! We need to get it down to 8-10 minutes. As a rule, I hate multiple choice tests. I’ll save that rant for another time. But one way to make them slightly more palatable is to use open-ended (“explain your answer”) data to create multiple choice options that reflect authentic thinking and are not artificially generated by an outside person. So I think we have a good compromise here.
As for the drawing questions, we decided to drop one of them. Looking at the data, it seems that there is a greater difference between the pre- and post-test answers on one of the items than the other. This implies it may be more sensitive to differences in groups, so we decided to keep it. Here is a sample pre- and post-test drawing made by one of the pilot participants:
The item asked them to draw two white dwarves merging. The first drawing shows two stars with a black hole in the middle. It is interesting that the person’s prior knowledge made them think there is a black hole in the middle of two merging stars (which is a very complicated concept to decipher). The post test still has the black hole, but now has lines to indicate an explosion, or greater luminosity. It also has arrows showing momentum. Are those lines orbital paths of two stars? Or do they represent the surface of two stars? The answer is important as it changes the interpretation of the arrows, which could indicate orbital motion or a spinning sphere or disc. This is why drawing questions are so tough to score. They can often reveal much more nuanced understanding by the participant, but scoring them requires sense making by the scorer (a.k.a. “grader”), thus introducing a source of noise. In education research, I fall into the “mixed methods” camp - which states that the best research uses both qualitative and quantitative methods. Hence why this study has both traditional test questions and these drawing tasks.
The recruitment challenges we reported on recently with the Museum of Science (MoS) have been largely addressed, thanks to near-Herculean efforts by their staff. They gave us a new location on the Museum floor that is open to the general public. Since then, recruitment has been much easier and the ages of the participants have increased.
I also have a glimpse at some (very) early results. Recall that in this study children aged 5-12 are shown pictures of scientific concepts and objects. The pictures are chosen randomly to be in either 2D or 3D/stereoscopic. They are then asked two questions about it. One is a specific question about a spatial property of the object (Ex: Which is closer to the camera, the tree on the left or the right?). The second is a question about the implications of the first question (Ex: If you were in a hurry to run to a tree, which would you run towards?).
Below is the graph of results of two questions we have about a picture of a bee. The first question is about the shape of the bee’s tongue. The second is a question about how the bee uses the tongue to sip nectar from different shaped flowers.
The results show that there is no difference in the number of “correct” answers to the first question. However, there is a difference in the number of “correct” answers to the second question – which is the one about implications. This could mean that 3D/stereoscopy doesn’t have much effect on immediate image processing but does have a larger effect on interpretation of the image. The sample size of this graph is enough to make this result statistically significant to the .01 level. However, the sample is still only about 1/6 as large as what we hope our final sample size will be. So this result could easily change. Also, it’s only one of 15 similar images we are testing. So this by no means would qualify as a result on its own. But it’s a hint of what could be and is an illustration of how we intend to use this data.
Data collection continues at the Boston MoS through the spring. We hope to have it wrapped up this May and then can get our hands dirty with the analysis. We’ll probably present some type of early result to the American Astronomical Society meeting in Indianapolis in June and hopefully at either the National Association of Research in Science Teaching (NARST) or American Educational Research Association (AERA) meetings in spring 2014.
The spatial cognition test as part of the Museum of Science study is based on the Purdue Visual Rotations Test. It is a 20 item test, of which we randomly chose 5 for this study (we don’t have time to do all 20). So far, we have run the test on 17 children. Below are the results:
Item #1: 47% got it correct
Item #2: 18% got it correct
Item #3: 0% got it correct
Item #4: 18% got it correct
Item #5: 29% got it correct
Each item on the test has five possible answers. So if the children guess at each item they have a 20% chance of getting it correct. So you can see from the results that the results for items 2 and 4 are similar to that provided by chance. We don’t need a statistical test to see that. But items 1, 3 and 5 are offset from 20% by different amounts. So can we trust those results as being legit? That is, that the result is due to actual measurement of spatial ability of the child and not due to chance? 47% is pretty high, so my gut says that’s a real result. 29% is not that far from 20%, so I’m not so sure about that. And the 0% result from item #3 is in the middle.
So I wanted to run a statistical test called a t-test. It’s one of the most basic statistical operations one can run and is often taught in the first week of Stats-101. The t-test assumes that our data follows a pattern seen commonly in nature known as the Normal Distribution. To run it, you need to first check your data to see if it meets this qualification. Below is a plot of the responses to items on the test. The curved line is an idealized version of the Normal Distribution.
Our data follows the distribution relatively well. So I ran the test. Results are often reported in terms of p-values. The most common p-value used is p=.05. That means the odds of the result being due to chance is around 5%. Statistically, in the social sciences a value of .05 or below is often reported as “significant”.
Our tests resulted in the following values:
Item #1: p=.002
Item #2: p=.083
Item #3: n/a because no one got a correct answer!
Item #4: p=.083
Item #5: p=.020
So this tells me that items 1 and 5 may indeed be measuring something in the children. Items 2 and 4 cannot be differentiated from chance, so right now those items are not very helpful. Item #3 was likely too difficult for this test.
This is what pilot testing is all about. I’ve decided to toss out items 2-4 and replace them with new ones from the pool of items in the Purdue test item bank. I took the new ones from the front of the item bank, which conceivably means they will be easier. We’ll analyze the data again after about 15 more children have taken the test. Eventually, we hope to have items that work for all children. Then we start to collect real data.
Caveat: In real life *much* more goes into the choice of items than just sensitivity to the population. I’m focusing only on one aspect here to act as an illustration of how the pilot test process works. Also, a sample size of 17 is hardly good enough to run a t-test on real data. That is, I’d never be able to publish such a result! :) But, again, for our purposes it works fine as an example of how this process works.
I spent the last week in September at the Museum of Science (MoS) in Boston. The goal of that trip was to setup and pilot test our stereoscopic kiosk for use for our second study. That study is about how children perceive spatial information in 2D vs. 3D images. The week was productive, but not as much as I anticipated. The biggest challenge occurred the day I walked in the door - our research location had been changed!
I wrote this grant two years ago. It took about 9 months to be judged and awarded by the NSF. The project officially began October 1, 2011 and this last year has been spent building things. During that time the rest of the world wasn’t going to stop and wait for us. When we wrote the proposal, the MoS had a space where I could do research on the museum floor with children 7-12. However, that space has seen been destroyed! The MoS has begun a major renovation to install a new permanent, cutting edge exhibit called the Hall of Life. So now the only space they have for me is in their Discovery Center, a place that is restricted to children up to age 8!
The Discovery Center is a great, fun place with helpful, smart staff and enthusiastic parents. The problem is the children are much younger than what my study was designed for. All of my test questions were written for children with some experience in science. Indeed, the entire design of the test was predicated on the ability of the child to be able to read. So this was, and still is, a big problem.
There is no other place in the museum for us. So we tried to adjust the test to make it age appropriate. We simplified some of the test questions. We changed the test protocol so that our research assistant is now reading the questions to the child and pointing to the screen, instead of the child doing it on their own.
But one thing we cannot change is the spatial cognition pre-test. Recall this is a critical part of our research that is used to measure prior spatial ability of the child. This is important because our ultimate research question is how spatial ability is related to how children perceive 2D vs 3D images. I think our spatial test is going to be too hard for most children and give us a floor effect in the data. There is no easy solution as there are no spatial visualization tests for young children that do not involve a human proctor and a significant amount of time (our test session time is limited to 15 minutes by museum rules). For now, we’re using this test with the plan to look at the data and see how sensitive the test will be. I’ll have a report in a few days.
This is just one example of how the age difference, while minor, has major implications for this study. There are lots of other ways in which our fundamental research questions had to be changed due to this limitation. Our “semester” at the MoS ends in December. At that point we’ll look at all our data and decide whether we can continue with this plan or whether we need to make other adjustments.
It’s an example of what happens in a real world research experiment. Unlike laboratory exercises, we have many additional sources of complications. We have to be flexible and less rigid in our design. The upside is our experiment is happening in a slightly more authentic environment (than a lab), so we’ll be able to say a little more about the results. And, ultimately, the goal is to have results that are applicable to real life. But the lab vs. in situ debate will have to wait for another blog post.
Last week we pilot tested the ipad assessment software and items for the first film at the Adler Planetarium. We offered free guest admission ($12 value) in return for watching the film and taking the pre- and post-tests. We did 3 showings of the film for a total audience of around 45.
The film is about 75% completed. Stock footage is used in a few places and some 2D/3D modeling work still needs to be done. Some of the scenes are taking over a week to render on their computers!
Overall, the process piloted well. People were more used to iPads than I had expected (we only had to help a couple of people) and no one complained about the length or complexity of the assessment items. The coordination of taking both tests and uploading the data was virtually flawless. We learned some small things (such as how stacking ipads with smart covers can make them all remain in an “on” state - thus draining batteries), but nothing major. I think, from a user perspective, they are ready for prime time deployment as assessment devices for the general public.
I have not looked at the data yet, other than to confirm it is there. So I don’t have any item statistics yet. I’ll do that in a few weeks and will post a summary of results. However, I can already tell that we will need to time the tests. Some people flew through it in a few minutes. Some others took 15 minutes and never moved beyond the 6-item spatial cognition pre-test!
The most difficult part of recruitment was that school has started for many so attendance at the planetarium wasn’t very high. Most people, when invited, were interested and we received only one response that we’d classify as “rude”. :) The most common reason people declined to participate was time (they were only going to be at the planetarium for a limited time) and the second reason was language (lots of international tourist visitors).
During the final showing the audio of the film messed up and we had to pause a couple of times to resync it. That was frustrating, but it’s the reason we pilot test these things. The audience was supportive and some even said they found it interesting to watch the process and be a part of it.
Here is a photo of a few people taking the pre-test.
It’s been two months since the last update. Since then I’ve changed jobs and moved from Boston to Chicago. My new position is Manager of Research and Evaluation at the Museum of Science and Industry, Chicago. I’m still running this project, though, as a contractor to the AAVSO. Moving a family 2000 miles slowed some of the work over the summer, but it didn’t stop it. Our app has been submitted to Apple for publication in their App Store. Our data collection shifts at the Boston Museum of Science have been scheduled. Also, we are two weeks away from pilot testing our first film at the Adler Planetarium with the public. Data collection for both studies begins in October! And there is much more, so there should be lots of reports coming fast and furious in the next few months.
Our 2nd study is going to consist of a test given to people before and after they watch a 2D or 3D film about supernovae or galaxy evolution. The test will be given on an iPad. We chose that technology (as opposed to traditional paper on clipboards) because:
We hired a contractor, Clockwork Active Media, to code the software and they are great to work with. They are very professional, talented and full of initiative. I highly recommend them for any of your research instrumentation needs.
Below is a mockup of some of the items. We plan to pilot test them this August at the Adler Planetarium in Chicago. The final software will be released via an open source license (probably via Sourceforge) in a year or two. So if you are interested in using iPads in your research, let us know! We can likely give you an early copy of the code if needed.