View Full Version : MS SQLserver guru..... HELP !!!!
Kbee
January 2, 2003, 09:01 PM
I need help with sql server 2000. I have a torn page[s] and the stupid thing refuses to start because the DB is marked suspect.
I have tried to issue a sp_resetstatus [after doing the usual sp_configure 'allow updates', 1 ] and then tried to do "DBCC CHECKDB ('Tacoda_Main', REPAIR_ALLOW_DATA_LOSS) WITH ALL_ERRORMSGS" but dbcc complains that the DB is "STILL MARKED SUSPECT !!!!"
I thought of using 'alter datase' and turning off torn_page_detection BUT THE DB MUST BE LOADED TO DO THIS!!!
i am at my wits end [did not take long] and this is really pissing me off now this is why i use UNIX ONLY !!!
I realy need to recover this DB [please dont say "do it from backup"], this is a new install, about 1 week old [ok week and .5] , however 400GB of targeting data is in this [of which 80GB is new]. so what i need is a way to recover data, I know there must be a registry option to turn off this torn page detec... so when sqlserver manager starts it does not try to do a torn page check on all the DBs.
-Kurt
jamrock
January 2, 2003, 11:43 PM
Kbee,
Don't have much experience with torn pages. Will have to do some research.
Questions:
Are you running in full recovery mode? That is do you have a transaction log?
Do you have a test server that we can experiment with?
jamrock
January 2, 2003, 11:46 PM
Two more questions.
Do you have a backup of all databases just after SQL Server installation?
How have you been bringing the data into SQL Server?
Kbee
January 3, 2003, 12:07 AM
tape backups exist of the very old database, I do it by using sp_addumpdevice 'tape', 'some_dumb_name','path_to_dev' and then issuing the BACKUP DATABASE statment to the named dev.
I have been creating a stored proc to do this automatically but with the upgrade of the system last week i have put that off.
However before the database upgrade [that is moving from the tacoda 2.1to 2.5 DB] the DB was stopped and the MDFs backed up.
-Kurt
P.s. I know that a backup can be set in the enterprise manager , however I have always prefered the cli and query analyzer is a close to cli as it gets in windows.
jamrock
January 3, 2003, 12:12 AM
What recovery mode are you using? If you have a transaction log we might be able to do something.
Kbee
January 3, 2003, 12:24 AM
transaction logs will only give point in time recovery, needless to say they will be very ineffective in restoring 80 GB worth of data
jamrock
January 3, 2003, 12:31 AM
Okay,
I just had a look at SQL. Here is something you can try.
I would really feel more comfortable if you could do an install of SQL on a machine somewhere and test this.
Whatever you do, you NEED to stop the SQL services and backup the directories where SQL is installed. If all else fails, this should allow you to restore the SQL files from a backup. Remember a test environment is best!!!
Start up Enterprise Manager and drill down to the server and then to the databases. Right click on the database in question and choose properties. Choose options. You can then remove torn page detection from that database.
If your database is damaged, removing torn page detection won't fix it. You may have to dump the data out, create a new database and get the data back in.
jamrock
January 3, 2003, 12:42 AM
transaction logs will only give point in time recovery, needless to say they will be very ineffective in restoring 80 GB worth of data
Not too sure about that... Restoring 80 GB of data is the same as restoring 1 MB. It just takes longer. My main concern is the integrity of the database.
Option one (my preference) is to do an install SQL on another machine. Restore the last good backup and apply the transaction log to it. This would mean that your data would not be damaged.
I accept your point re: the amount of time involved.
I would be using non-SQL backup and restore (Ntbackup.exe, Veritas etc.) to move the data across to the new server. That way I wouldn't run the risk of changing the status of any files.
Once you are back on your feet, give me a shout if you want to discuss some backup/recovery options. Hope this helps.
jamrock
January 3, 2003, 12:45 AM
Going to catch some zzzzzzzzzz's. I will check back in the morning.
Kbee
January 3, 2003, 12:58 AM
Thanks for the advise, I think I am going to try something [somewhat risky] ,but I will tell more tomorrow.
-Kurt
Kbee
January 3, 2003, 06:20 AM
Not too sure about that... Restoring 80 GB of data is the same as restoring 1 MB. It just takes longer. My main concern is the integrity of the database.
I have never really investigated sql server but I have extensive knowledge of postgresql and mysql [innoDB tables] and you cannot recreate an entire db from a transaction log but you can do segment rollbacks and correct corruption[minor] with it. however I guess If i were to backup ever transaction written to a log file it may be possible.
Option one (my preference) is to do an install SQL on another machine. Restore the last good backup and apply the transaction log to it. This would mean that your data would not be damaged.
The last backup is of the old 2.1 DB which is not compatable with the table structure and stored proc. of 2.5 , so doing that is not possible ,as you would have to run all the upgrade scripts to migrate the existing data to 2.5 [which tacoda has removed from the servers after installing the new one] + finding another terabyte system to handle old + new is not possible.
I accept your point re: the amount of time involved.
I would be using non-SQL backup and restore (Ntbackup.exe, Veritas etc.) to move the data across to the new server. That way I wouldn't run the risk of changing the status of any files.
Once you are back on your feet, give me a shout if you want to discuss some backup/recovery options. Hope this helps.
This will be considered , especially i am very fimilar with Veritas Netbackup [well the Unix edition]
Whatever you do, you NEED to stop the SQL services and backup the directories where SQL is installed. If all else fails, this should allow you to restore the SQL files from a backup. Remember a test environment is best!!!
I had done that long ago ,as i said in my earlier message, however this makes no difference because upon load the DB is still corrupt.
Start up Enterprise Manager and drill down to the server and then to the databases. Right click on the database in question and choose properties. Choose options. You can then remove torn page detection from that database.
Sorry jamrock , but this would never work with sql server 2k because the DB is maked suspect and therefore all operations are suspended , therefore you would not be able to do this.
What works [sometimes] is to do this :
/* unlock master to allow updates */
USE master
GO
sp_configure 'allow updates', 1
GO
RECONFIGURE WITH OVERRIDE
GO
/* next step is to reset the status from 256+ to 0 */
UPDATE master..sysdatabases SET status = status ^ 256 WHERE name = <dbname>
/* or even better/cleaner */
sp_resetstatus dbname
/* then Lock the master DB records */
sp_configure 'allow updates', 0
GO
RECONFIGURE WITH OVERRIDE
GO
and do a DBCC check as I stated earlier in this thread
however if your have badly corrupted tables then it will not work like this so the risky operation is what i decided to do, but it makes logical sense [except that logic does not apply to any MS app] , I decided to :
1) Create a new, empty database (call it Tacoda_Main_Recover)
2) Put Tacoda_Main into Emergency (or Bypass) Mode
3) Do a full DB copy from Tacoda_Main to Tacoda_Main_Recover using bcp out then in
4) Do a DBCC consistency check on this recovered DB
5) Alter TALoader setup (through TALControl) to point at Tacoda_Main_Recover and see if all is good
I found that if you do UPDATE master..sysdatabases SET status=-32768 WHERE name='dbname' it will set the DB in emergency bypass mode, this is a offline read only mode and from this I will do steps 3-5. BUT THIS IS STILL NASTY , ORACLE DOES NOT DO THIS, POSTGRESQL,DB2 even THE LOWLY MYSQL is FAR better than this CRAP !!!
But i bet the marketing muscle of Micro$oft is selling tons of this in JA. So why am I using this stuff, well 1) it is a third party product that is built around sql server 2)My boss knows the owner of tacoda and they have some deal going, so it was his call.
Well time to get another Sqlserver book, I will say this much , If this procedure fails I am calling microsoft to help fix it, because pulling a 24hr on DB recovery [especially when i do not know it so well] is nasty.
-Kurt
jamrock
January 3, 2003, 08:18 PM
Hmmm,
I don't know...
Our databases aren't as big as yours. Let me state that up front. Still we really haven't had any problems with SQL Server 2000.
The trick from what I can see, is to have a solid data recovery plan. With a good plan you can pretty much have full point in time recovery.
We do a full backup every night and 4 transaction log backups each day. The backup of the logs takes place while all the users are busy working away. No-one complains about reduced speed.
When the nightly backup starts taking too long we will switch to a full backup on Sundays, differential backups on week nights and 4 log backups during the day.
The backups are automated via Enterprise Manager and are done to disk. An automated backup from disk to tape is scheduled about an hour after the estimated end of each backup to disk. This tape backup is done using a standard backup application. You can even use Ntbackup.exe.
We have configured Enterprise Manager to delete all backups older than x days.
It works for us and we can do point in time restore. We schedule periodic test restores to a test server just to make sure that everything is working well.
I recommend having the backups on one (hardware) raid mirror, the logs on another and the data on a raid 5 array.
By the way, microsoft.com says that torn pages usually occur when there is some kind of disk failure when SQL server is writing. (Something to that effect). What caused yours? I need to see how our data recovery plan would deal with this.
Peace.
Kbee
January 3, 2003, 09:13 PM
Just as they said "a failure during write"
<RANT>
DUDE ....WE BOUGHT A DELL....WERE SCREWED !!!, there is something very wrong with the dell poweredge 2550 using a perc controller, this is the second time we have had torn pages because of some weirdity with the controller. First off dell ships the server [pre-configured] with 2 IRQ conflicts with the controller and the other internal cards [THAT THEY CONFIGURED], this was my first encounter with the dreaded torn page, took me sometime but that was easily fixed with the method stated below.
This time, (now i know you will find this hard to believe), one member of the raid5 array goes offline [now remember raid5] , this happens during the datawarehouse maintainance cycle, which is scheduled at 1am and is very IO intensive, the "degraded" state of the array led to write time outs with the DB and later serious corruption of the NTFS filesystem [after recovery i got found.0000 directories], but how is it, that windows handles this situation so badly, we have unix servers with just as much load handle an array rebuild gracefully without corruption of the FS. To think that microsoft has the gaul to call this product Windows 2000 *Advance Server*.
</RANT>
Anyway your backup policy is a good one, which is what we have employed on our unix servers[using legato network backup], however this is our only Winblows box that we as SAs have to deal with, and I am the only one who was willing to get dirty and learn MS Sqlserver.
I hope that this is something that i never have to repeat, and believe me, backing this up is frequently is going to be a major priority.
As for your system being able to handle this , i am sure it will as you have multi levels of backups , however to restore 400gb of data from tape is really slow , so i will place 4 more 73GB SCA disks in the array [make it 1.3TB] and backup to the free sapce created, for fast [semi redundant restores] and then backup the semi reduntand [disk backups] once a day.
Thanks for the brain storm last night, I have restored used the "risky method" and will let you know how it goes when i complete the restore tomorrow evening [it would be earlier but one 24hr run is all i can do]
Well, I guess that I must admit that we could have been more proactive with the backups of this system, especially after we started to test out the new 2.5 system [and load 80GB of new data], this is a lesson that i do not want to repeat.
jamrock
January 4, 2003, 09:56 AM
there is something very wrong with the dell poweredge 2550 using a perc controller, this is the second time we have had torn pages because of some weirdity with the controller.
That is unusual. Most people I talk to have found the Dell servers to be quite reliable. Perhaps you could ask Dell support to send a technician to look at the controller and the hard drives.
If the issue is not resolved, the torn page is likely to happen again.
we have unix servers with just as much load handle an array rebuild gracefully without corruption of the FS
That does not surprise me. Windows has been trying for some time to duplicate the Unix scalability and reliability. They still have work to do.
however to restore 400gb of data from tape is really slow , so i will place 4 more 73GB SCA disks in the array [make it 1.3TB] and backup to the free sapce created, for fast [semi redundant restores] and then backup the semi reduntand [disk backups] once a day.
My only concern here is redundancy. I strongly believe in having the backups and logs on different raid mirrors. See previous post.
If anything happens to the data on the raid 5 (spindle dies), you still have access to the logs and the backups. You can load SQL Server on another machine, restore the last full backup, differential backups, log backups and be up and running in reasonable time. Of course with such a large database a "reasonable time" may be quite a while.
It may be a good idea to have some kind of test server around. This could be a less powerful machine that you can use temporarily if you have hardware failure. Something that you could use for a few days until you can replace a board or controller etc. This test server can be used to test Microsoft's many patches, hot fixes, service packs, etc. before you roll them out to your production environment.
Looking forward to hearing the outcome.
Kbee
January 4, 2003, 04:53 PM
dell has offered support and gave the all clear on the box,including upgrading the drivers.
However this time it was the drive that failed , but reading about how SQL server works [especially when doing heavy IO] while the raid controller is rebuilding the array , i can see why this happens, so i guess the blame is on the design of sqlserver and not dell.
Just so you know, with the perc [power edge raid controller] card you can set multiple levels of raid on the same card. When i said i was going to place 4 more drives in the array i meant , placing 4 drives in the power vault storage array, and not the raid5 array [the PV has a raid 5,0and1 set] i will just be adding another raid0 set.
While a stripe is not redundant by any means , it is fast and since it is a temp, short-term recovery area [before sending it to tape].
The best solution to be fully redundant [and offer less downtime] is to replicate the DB, however, i have to justify to my boss the expenditure of $20k + purchase of a second SQLserver .
-Kurt
P.s. Setting up a sime test server is not possible since the requirements to load and test this DB are very high [2x xeon 8 GB of ram]
jamrock
January 5, 2003, 08:11 AM
with the perc [power edge raid controller] card you can set multiple levels of raid on the same card
Now that you mention it... I have always wanted a better understanding of how hardware raid works. Do you know where I can find some documentation? The folks at Dell weren't able to provide me with anything.
the PV has a raid 5,0and1 set
What do you have on each set? There are so many possible options. I have read of people who have even put the swap file on a mirror all by itself. This is supposed to make SQL faster.
The argument here is that SQL Server is constantly reading from or writing to the database, log files, backups, swap file and Windows files. If each of these is on its own raid set, SQL can read/write everything simultaneously. If not, it has to wait for one task to be completed before it can carry out the next.
Of course, all of this requires a lot of drives (and money).
Kbee
January 5, 2003, 10:42 AM
OK , so after 26.5 hours I now have completed the bulk copy of data from the old DB[corrupted] to the new. I will begin testing, but based on my query on the importjobhistory table i guess i have lost everything after the 31st [but that is the night that the system went dead anyway].
I am assuming that you are fimilar with software raid, and the differences in raid levels, well if so , just remember that hardware raid offers less overhead while the system has to maintain parity or rebuild a member of the set. If you upgrade your OS or something goes wrong with the OS you have nothing to worry about WRT losing your raid config or configuring the OS[ok, i once upgraded my linux kernel without putting MD back in] to read from the device , as this is stored in the bios on the raid card.
I do however , find that raid0-1 seem very fast in software raid under linux and Solaris ,will rebuild with little system overhead. However during heavy IO [like the maintaince cycle that my datawarehouse has] a rebuild can lead to IO starvation/timeouts and general disk head contention, with both hardware and software raid but software adds the extra cpu overhead to that problem. If I am incorrect and you know nothing of raid levels, please check this doc, it was the first i could find online;
http://www.acnc.com/04_01_00.html
This site gives extensive background in the different raid levels [pros and cons]
WRT my filesystem layout ,one set has the main DB [raid5] the other has the tempdb and T-logs[raid0 ] the other has the importDB[raid1], swap is generaly kept on a software raid mirror [aka raid 1]. The reason for this config is that i do not want to deal with head contention during the various operational phases of the datawarehouse.
-Kurt
jamrock
January 5, 2003, 02:41 PM
I am assuming that you are fimilar with software raid, and the differences in raid levels
Yes.
However during heavy IO [like the maintaince cycle that my datawarehouse has] a rebuild can lead to IO starvation/timeouts and general disk head contention, with both hardware and software raid
This is not surprising. The regular read and write processes work more slowly when you are rebuilding raid. It is therefore logical for timeouts to occur during heavy I/O activity. In my environment, I would have the luxury of doing the rebuild at night when users are off the system. I would just have to remember not to run any scheduled activities.
it was the first i could find online;
http://www.acnc.com/04_01_00.html
I will have a look. I am quite comfortable with the theory behind raid and all that. I have used software raid on Windows 2000 before.
It is just that I have never opened up a Dell server and worked out the logic of the perc. All the ones I have used have come pre-configured from the manufacturer. Strangely enough, I have never had to deal with the failure of a disk on a hardware raid array.
I guess one day I will just have to jump in and configure some kind of raid device so I can get the full understanding of how it works.
Setting up a sime test server is not possible since the requirements to load and test this DB are very high [2x xeon 8 GB of ram]
Here is the deal as I see it. You have to be careful how you apply Microsoft's service packs, etc. There have been cases of machines not restarting after patches have been applied.
I received notification some months ago of issues with Windows 2000 service pack 3. In some environments, the Windows installer didn't work after it was applied. Microsoft has had to publish quite a bit of documentation re: how to apply it.
Their official position is that admins need to test these service packs before applying them. In other words, if it don't work, don't blame us. We told you to test the thing before you use it.
jamrock
January 5, 2003, 02:56 PM
OK , so after 26.5 hours I now have completed the bulk copy of data from the old DB[corrupted] to the new.
26.5 hours??? Okay...
I will begin testing, but based on my query on the importjobhistory table i guess i have lost everything after the 31st
It could have been a whole lot worse. If the integrity of your database is okay, then you have come out a lot better than many. Is there any way for you to identify the missing data? In some environments the database consists of information that users have entered from paper documents.
Kbee
January 5, 2003, 03:20 PM
;D yep 26+ hours but i am so tired [i am also oncall] that i have not started to test the working state of the application.
The missing data is not someting that i need to worry about , I just have to reprocess the access logs from the webservers.
Tacoda is an audience targeting system [costing about 250k, just for software ], so it allows you to target users/visitors of your site[and much more], more info below :
http://tacoda.com/product_index.html
raid config is easy, between the various adaptec,Mylex[DEC960],perc and 3ware cards i have used , it really does not seem to differ that much amoung the brands,
All it take is for you to define a raid set and type and select the physical disks to add to the set(s).If you are fimilar with software raid under linux or Solaris then you will have no problem with this.
Kbee
January 8, 2003, 12:43 PM
I forgot to update you , THE DB IS BACK !! ;D with *minimal* loss. and I learnt more than i wanted to about sqlserver, however I will chalk it up to experience [one i hope never to see again].
-Kurt
P.s. beware of bcp and autoincremented columns remember -E is your friend.
jamrock
January 8, 2003, 08:41 PM
THE DB IS BACK !! with *minimal* loss
Glad to hear that. A rather scary experience if you ask me.
however I will chalk it up to experience
So will I. If I ever have to rebuild a raid array, I will disable all scheduled events until the rebuild is complete.
Congrats!!!
Powered by vBulletin® Version 4.2.0 Copyright © 2013 vBulletin Solutions, Inc. All rights reserved.