Archive for December, 2011
- Wrote blog post about performance analysis work (results, what I did, etc.)
- Ran a couple of new experiments (played with a distributed mongrel cluster setup) and finished up experiment-repeats.
- Updated result logs on github.
- Did a screencast showing a load test in action.
- Some code review
- None (I’m done for the term)
- Fixed assignment graders and groups inability to download/upload CSV files. (Issues #583 & # 584)
- Implement requested code changes for CSV multiple encoding fix.
- Break up the CSV multiple encoding fix into separate feature branches for easier reviewing and merging.
- Fixed up “add new” graders code on Severins advice (waiting for review: 1137)
- Worked out 3/4 of the remaining group bugs (waiting for review: 1151)
- Get those last 2 bugfixes done, reviewed, and shipped
- Could use some advice on an issue with review 1151, not sure where the problem lies.
- Finished up work on issue 528
- Posted the issue refering to remote_functions on github
- Get the final commit for issue 528 done
- Add the tests Severin was refering to in the review for issue 528
- Could use some advice on how to do the tests that were requested on issue 528
Here is a screencast of a running MarkUs load test (ogv). Enjoy!
This term I’ve been working on analyzing MarkUs’ performance under load (see ). The goal was to investigate if MarkUs’ performance decreases significantly under certain circumstances. If it does, a sub-goal of my project was to investigate what caused a potential performance problem. In particular, I was simulating a scenario where students would work alone on an assignment. Moreover, I was looking at the extreme case where no Subversion repositories existed for any student.
At this point I can say, yes, there can be a performance problem. When too many students try to submit files via MarkUs at the same time performance deteriorates. A user would notice this performance problem by very long response times. In extreme cases response times of 20 seconds for a single request (not hit) have been observed. The root cause as to why response times increase drastically seems to be IO related. Where are these IO requests coming from? I’m still not 100% certain, but it looks like this is related to creation of Subversion repositories. MarkUs stores student’s submissions in Subversion repositories. If no repository exists when a student logs in for the first time and students work alone on an assignment MarkUs creates a Subversion repository for that student when the student interface URL is first visited for this assignment. Also note that every submission results in calls to Subversion via the Subversion Ruby bindings from within MarkUs.
In general, the higher the ratio of simultaneous requests to the number of mongrels (or Passenger workers), the slower overall response times. Variations of response times when the number of students in a course are changed are fairly minor.
As mentioned earlier, these are results of simulating a scenario where students work alone on an assignment and have never logged in previously. Student requests have been simulated by running the post_submissions.sh script on client machines. Scripts which I’ve been using are available in this review and should be available in the official MarkUs repository at some point later. Only PostgreSQL has been used as the DB backend. I don’t believe changing this to MySQL will yield much different results. Apache httpd has been used as the front-end webserver reverse-proxying to individual Mongrel servers. For some experiments Phusion Passenger has been used instead of Mongrel. The difference in performance between the two deployment platforms was fairly insignificant for the experiments performed (considering that Passenger uses 6 Ruby workers by default and comparing it to a similar setup with 6 Mongrel servers). For this analysis MarkUs version 0.10.0 has been used (on Rails 2). I don’t anticipate huge differences between a Rails 2 and Rails 3 based MarkUs. Details about the lab setup I’ve been using were described in the first blog post of this series.
In order to get a more detailed view as to what was going on on the MarkUs server machine while each individual experiment was run the following tools have been used: top, iotop, iostat, oprofile, request-log-analyzer and some hand crafted scripts. OProfile data was inconclusive (or I was perhaps using it wrong). Example profiling output is available here. Top reported load averages of 3-18 and up to 50% (avg ~30%) IO waiting with 20-60% user CPU utilization. Less IO wait percentages have been observed towards the end of each experiment and when 12 mongrels have been used. iotop reported Linux’s Ext4 journaling daemon as the top IO consumer closely followed by Ruby’s logger for the production.log file of each Mongrel.
Here is the list of performed experiments:
|Exp. #||# Stud.||# Mon.||# Cli.||# R.p.Cli||Sim. cc. Subm.|
|P1||800||equiv of 6||8||4||32|
|P2||832||equiv of 6||8||8||64|
|P3||800||equiv of 6||8||4||32|
|P4||800||equiv of 6||8||4||32|
Exp. # is the experiment identifier, # Stud. is the number of students, #Mon. is the number of Mongrel servers, # Cli. is the number of client machines used (where the post_submission.sh script was executed #R.p.Cli times), # R.p.Cli. is the number of post_submission.sh calls per client machine (i.e. one client machine simulated up to 18 students) and # Sim. cc. Subm. is the number of simulated concurrent submissions (= #R.p.Cli x # Cli.).
All “P” experiments have been performed using Phusion Passenger, all “M” experiments were Mongrel based. M13-M15 varies only the number of students in a class. M12-M20 were basically a repeat of experiments M1-M9. M10-M20 had configuration in place so as to make Mongrels log to their individual copy of production.log. M1-M9 shared one production.log and logs were useless due to interleaved log output. P3 and P4 are interesting as for P3 almost no SVN interaction was achieved by running the same experiment twice without deleting repositories and dropping the database of the previous run. Due to this submissions were not accepted. One would have to explicitly replace files as opposed to resubmitting them in order to get them accepted by MarkUs. This is expected behaviour. Hence, this repeated submission resulted in so changed submissions being recorded (i.e. almost no SVN interaction). P4 is an experiment with student’s repositories created prior the actual run of the experiment. However repositories were empty so submissions as issued by request #7 were recorded.
The requests/URL mapping is shown in the following table:
These are the requests (in order) each call to post_submissions.sh performs. I.e. get the log-in page, log in (POST), follow the resulting 2 redirects to the students dashboard, go to the student interface of the first assignment, open the file manager (Submissions link), submit files (POST).
Raw logs and tables are available in a Git repository which I’ve created for this purpose. Logs have been analyzed by using the elapsed time (as reported by /usr/bin/time) per request on client machines. Sanity checks have been performed by also analyzing server-side logs (production.log) via request-log-analyzer. Server-side generated and client-side generated response time numbers matched with a small margin of error.
The above graph illustrates that with increasing number of students in a course the response time grows fairly slow (400 to 1400 students results in an increase of response time from 2.2 to 2.8 seconds per request)
The above two graphs try to show if there is a correlation between the ratio of simultaneous requests over the number of mongrels and the average response time. In general the more overloaded a single Mongrel the slower the overall response times. More mongrels may bring the response times down a little bit but not to an amount as one would have hoped for (see M19 and M9). There may be some performance gain if mongrels run on different machines than the Apache reverse proxy and the PostgreSQL server. In most experiments these ran on one machine. M20 had 6 mongrels running on the main server and 6 mongrels on a different machine. Perhaps more gain could be achieved if the database server is on one machine, the reverse proxy on another and the mongrels distributed among a set of other machines sharing the file system containing the Subversion repositories.
This graph shows the average response times per request. Please refer to the table above in order to see which request number maps to which URL. The numbers in the legend are the total numbers of simulated concurrent requests. We now take a closer look at experiment labelled with 32 in the above graph (experiment P4). Note that 32 should be compared to 14 as it’s not the amount of simultaneous requests alone which are of significance. The number of mongrels running on the server are a factor as well. Thus, the ratio between concurrent requests and the number of mongrels seems to be a good heuristic for comparing experiments. Note that said ratio is closer between 14 and 32 as compared to 32 and 35 (see graph below). Since experiment P4 had Subversion repositories already created prior the experiment it is not surprising to see the absence of the bump of request 5.
This is the exact same graph as the one preceding this one with the only difference in the choice of the labelling. Instead of the number of concurrent requests it shows the ratio of the number of concurrent requests over the number of mongrels.
Conclusion and Future Work
So what are the lessons learned?
- Under heavy load and in a poor setup response times of 20 seconds and more can happen
- Adding mongrels is fairly cheap and may bring some performance gain.
- Distributing mongrels among a set of application servers may improve performance even more.
- Subversion interactions are expensive. I recommend to get students to log-in and have a look at the assignment (if it’s a single student assignment and the SVN repository has not yet been created) at some off-peak time in order to reduce IO when the deadline of an assignment is approaching.
- Logging to production.log may be a source of IO on the system.
- 12 mongrels seem to perform better than 6. The performance gain from 3 to 6 mongrels is less significant. I’m not sure why…
- I recommend users to estimate the expected number of concurrent submission based on the number of students in the course and historical data. Based on the expected number of concurrent submissions at peak time try to keep the ratio of concurrent submissions over the number of mongrels at < 5
- It is a good idea to add configuration to the environment so that mongrel instances log to separate log files. This way production log files can be used for further analysis with regards to performance bottlenecks.
Where to go from here? It would be interesting to see if a different version control system as a back-end would change any of the above results. Moreover the assumption is that IO is coming from Subversion, but what if its just simple logging or logging plus Subversion plus IO from PostgreSQL? Perhaps there is some better way to inspect low level IO. This may help with reasoning as to where said IO is coming from. iotop and top should be a good start but may be too coarse grained. I’d also be interested to see if a more distributed production setup would be capable of processing more concurrent requests in less time.
It’s been fun working on this project. Please do let me know (in the comments) if there is something I’ve missed or if you have other thoughts on this topic. Thanks!
- UCOSP is beginning to wrap up for the term.
- Did rounds as usual.
- Severin is redoing most experiments that were done throughout the term to collect logs while also having started on the final write up for his work for the term.
- Severin mentioned possible future work relating to the project he has been working on.
- Alex finished up all the changes to the working file uploads and just needs to setup all the changes for review.
- Alex still wants to fix up some routing errors related to file download and upload and intends to be done within two weeks.
- Luke has been waiting on some reviews and translations and intends to square away things before the holidays.
- Erik has started looking at the groups page bugs and hopes to get those fixed by the end of next week.
- Razvan has started working on adding functionality to the sections page and hopes to get things wrapped up by the end of next week.
- There will be a post-mortem wrap up next wednesday 3-4
- Received funding to bring someone from the current UCOSP team to Vancouver next term for the code sprint in order to help setup go more smoothly.
- Re-ran some experiments with separate production.log’s per mongrel
- Got started on final wrap-up blog post for my performance analysis work.
- As before: do a screencast of running tests
- Finish up experiments repeat and blog post
- Other assignments
- Added the required functional test to grade_entry_form for the CSV multiple encoding fix. Issue 284.
- Ensure that my outstanding fixes are reviewed and shipped.
- No real work done due to end of term and finals.
- Waiting on review for issue 527 (Add delete action for sections)
- Waiting for translations on 501 (Decay penalty localization)
- Same as next steps
- Still waiting for somebody to review the fix for bug 439 (review id #1137)
- Working on the last of the group page issues (the buttons in the table are broken)
- Finish up the group related bugs
- Tried putting in a review for issue 528
- Looking for a couple more issues to do before the term ends
- Can’t submit a review due to a “UnicodeDecodeError” when running the submit script
- Started off by doing rounds.
- Luke put up a review fixing issue #527 (delete action for sections), waiting for it to be checked before continuing.
- Luke is planning to begin working on issue #501 (error messages for missing translations), as well as fix some spelling mistakes.
- Severin has been trying to get passenger use different production.log’s per ruby worker.
- He is currently unsuccessful because log messages are interleaved.
- Severin ran an experiment using memory repository which confirms the performance bottle-neck are SVN/File IO.
- Erik has posted a blog post about firebug and fixed issue #439 (admins cannot create groups).
- Luke suggested we close our reviews after they have been pulled in.
- Razvan was not been able to work on Markus this week. Busy with projects.
- Alex was able to add some functional tests to his multiple encoding file uploading fix.
- He ran into some trouble with the grades page. There seem to be errors in the file upload and the tests for this page are a little more complex than the previous pages.
- First he will try to figure out why the page is giving and error when a file is uploaded before figuring out the tests.
- Focus for next week is wrapping up issues and making sure the code is ready for the next semester students.
Here is the State of Art of the Markus Test Framework. We will describe what is working for each view (Administrator, Grader and Student).
Create a new assignment with test
When we create a new assignment, we can enable the test framwork. If we do so, we can upload a build.xml and build.properties, and add some test files (public or private).
This part is now working well, since the routing bug are fixed.
Due date passed
Buttons “Release Marks” and “Unrelease Marks” lead to a 404 error.
“Collect All Submissions” links to RecordNotFound in SubmissionsController#collect_all_submissions.
For each submission, the “Collect Submission” lead to a 404 error (issue https://github.com/MarkUsProject/Markus/issues/441)
We can’t test if the Admin can run the test because of the errors posted in our last post on this blog.
We suppose this configuration being applied by the administrator :
- Tests have been uploaded
- Tests are public
- Tokens are available
- The return date is passed
First, the student has to create a working group or click he wants to work alone.
The Administrator assign a Grader to the Student.
Then, the Grader is supposed to be able to use the testing framework :
- Click on one of the current assignment.
- “Can begin grading” is True
- On clicking on a Group name : the Marking State changes from « Not Collected » (Blue square) to “In progress” (a yellow pencil) and a confirmation message “The submission has been given collection priority. It will be ready soon.” appears on screen.
Then there are problems with collecting all assignments, applying tests or getting results :
- On clicking on « Collect all your assigned submissions » an error page appears ” ActiveRecord::RecordNotFound in submissionsController#collect_ta_submissions
- No test can be run and no results loaded.
- The button “see and comment the result details” is still unavailable.
We suppose :
- Tests have been uploaded by the administrator on the administrator view
- Tests are public
- Tokens are availables
- The return date is not passed
Steps to use the testing framework as a student:
First, the student has to create a working group or click he wants to work alone.
Then, he can post his work (1. Click on submissions, 2. Click on “add new” and find the file to upload, 3. Click on “submit”).
Then, he can use the testing framework:
- Click on “assignments”
- For the first time of use: the button “collect and prepare tests” is available so he can click on.
- Click on the button “run the test”
What happens ?
At this point, we can launch the tests, they are running but no result is collected and so there is no result available.
Previously, there were some issues of displaying but now all is well displayed (11/30/2011 patch by Guillaume).
If we change some configurations:
- We suppose that there is no token available or tests are private.
We have the same situation as previously. (Shouldn’t we have no button available to run the tests in those cases?)
- If the due date is passed: No problem appears. Indeed, the student cannot launch the tests.